[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# AI Program

## Exploratory Data Analysis

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 09/08/2025 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0002PointLine.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Scientific Python

# Machine Learning
from sklearn.linear_model import LogisticRegression

# Miscellaneous
import difflib #<! Fuzzy text search
import os
from platform import python_version
import random
import re #<! Regular Expression

import onedrivedownloader

# Typing 
from typing import Callable

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ?????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

# warnings.filterwarnings("ignore")

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
sns.set_theme() #>! Apply SeaBorn theme
# sns.set_palette("tab10")

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

In [None]:
# Course Packages


In [None]:
# Auxiliary Functions


## Exploratory Data Analysis (EDA)

The EDA part is a prior stage to the Machine Learning pipeline.  
It is used as an introduction to the data in order to generate an intelligent path to process it in light of the objective.

A common objective for the EDA phase:
 - Detection of outliers and invalid data.
 - Validation of assumption.
 - Preliminary selection of appropriate models.
 - Determining relationships among the features.
 - Determining relationships between features and the objective variable (Label).

This notebook presents some visualization tools and concepts by an analysis of known data sets.  
The visualization toolboxes, beyond [MatPlotlib](https://matplotlib.org/) used are:

 - [SeaBorn](https://github.com/mwaskom/seaborn) - Statistical data visualization in Python.
 - [PlotLy for Python](https://github.com/plotly/plotly.py) - The interactive graphing library for Python.

</br>

* <font color='brown'>(**#**)</font> While the EDA is prior to the ML pipeline, the work is iteratively and bi directional.
* <font color='brown'>(**#**)</font> EDA is crucial in _Classic Machine Learning_ as it is the step to experiment with **Feature Engineering**.
* <font color='brown'>(**#**)</font> EDA does not require labels, though it should utilize them if available.

### Types of Data

```mermaid
mindmap
  root )Data Types in ML(
    Structured Data
        ((Tabular 📊))
          (Customer Data)
          (Employee Data)
        ((Structured File 📄))
          (Jupyter Notebook)
          (HTML File)
    Unstructured Data
        ((Text 📝))
          (Product Reviews)
          (Twitter Posts)
        ((Image 🖼️))
          (Chest X-Ray)
          (Satelite Multi Spectral Image)
          (Smartphone Image)
        ((Audio 🔊))
          (Speech Command)
          (Podcast Recording)
        ((Time Series ⏱️))
          (Hourly Energy Usage)
          (Stock Value)
        ((Geospatial 🗺️))
          (GPS Tracks)
          (Electric Poles Coordinates)
        ((Graph 🕸️))
          (Social Network)
          (Family Tree)
        ((Event / Log 🧾))
          (Clickstream)
          (Machine Log)
```

![](https://i.imgur.com/mknxPtI.png)
<!-- ![](https://i.postimg.cc/q7f6vMQd/mknxPtI.png) -->

### Types of Data Element

```mermaid
flowchart TD
A[Data Element Type in Machine Learning]

A --> B[Numeric]
B --> C["`Continuous<br/>(e.g., Daily temperature °C)`"]
B --> D["`Discrete<br/>(e.g., Number of patients)`"]

A --> F[Categorical]
F --> G["`Nominal<br/>(e.g., Color = {red, green, blue})`"]
F --> E["`Binary<br/>(e.g., Fraud: 0/1)`"]
F --> H["`Ordinal<br/>(e.g., Satisfaction = {poor, fair, good, excellent})`"]


%% Styling
classDef root fill:#7f7f7f,stroke:#404040,stroke-width:4px,color:#ffffff;
classDef num fill:#e3f2fd,stroke:#1976d2,stroke-width:4px,color:#0d47a1;
classDef cat fill:#fff3e0,stroke:#ef6c00,stroke-width:4px,color:#e65100;
classDef catnom fill:#fff3e0,stroke:#ef6c00,stroke-width:4px,stroke-dasharray: 5 5,color:#e65100;

class A root;
class B,C,D num;
class F,H cat;
class E,G catnom;
```

![](https://i.imgur.com/oO1Q22n.png)
<!-- ![](https://i.postimg.cc/432syK4L/oO1Q22n.png) -->

### Guidelines per Data Type  

There are guidelines related to the visualization of different data types:

 - Categorical: 
   - Single Variable: Bar Plot, Pie Plot, Word Cloud (Text).
   - Two Independent: Venn Diagram, 2D Bar Plot.
   - Adjacency: Graph, Sankey.
   - Sub Group (Data & Labels) / Nested: Scatter, Grouped / Stacked Bar Plot, Dendrogram.
 - Numeric
   - Single Variable: Histogram / Density.
   - Two Independent (Not Ordered): Box Plot, Violin Plot, Marginals Plot, Scatter Plot, 2D Histogram.
   - Two Independent (Ordered): Line Plot, Area Plot.
   - Several (Not Ordered): Box Plot, Violin Plot, Bubble Plot, Ridge Line, Dimensionality Reduction (PCA), Measure Heatmap (Correlation).
   - Several (Ordered): Stacked Line Plot, Stacked Area Plot.

Some of the concepts holds for the mixed case.

</br>

* <font color='brown'>(**#**)</font> A detailed guide matching a visualization to a data type is given in [From Data to Viz](https://www.data-to-viz.com).
* <font color='brown'>(**#**)</font> [Wikipedia - Box Plot](https://en.wikipedia.org/wiki/Box_plot).


## EDA - The Diamonds Data Set

The dataset variables are:

 - _Carat_  
   Carat is a metric that is used to measure the weight of a diamond. One carat is equivalent to 200mg. Diamond prices increase with diamond carat weight, which means bigger the diamond higher the price. If two diamonds weights are equal, then other features are used to determine the price. 
 - _Cut_  
   The goal is to cut a diamond within an appropriate size shape, and angle such that the light entering the diamond should reflect and leave from the top surface.  
   The values are Ideal, Premium, Good, Very Good, Fair.  
   This feature is an important thing to notice in a diamond as it measures three crucial things, such as:
    - Brilliance: It means the brightness of a diamond by the reflection of white lights inside and outside of a diamond.
    - Fire: It means Scattering of white light into all the colors of the rainbow.
    - Scintillation: the amount of sparkle produced and the pattern of light and dark areas caused by reflection within a diamond.    
 - _Color_  
   Color measurement in diamond measures lacks color. If the diamond color is like a drop of water that is colorless, it will have a high value. As then only it can scatter the light without observing. However, there are some diamonds that are in different colors will have higher prices.  
   The color scale is categorized from D to Z letters and ordered in ascending by the amount of presence of color in a diamond. From the K onwards to till Z, we can see a yellowish color present.  
   D ,E,F - Colorless G,H,I,J - Near colorlessness K, L, M - Faint color N-R: Very Light Color S-Z: light color.
 - _Clarity_   
   Diamonds are generated from sheer pressure and heat below the ground. Therefore, there will be some inclusion inside a diamond i.e., a mark or line pattern inside a diamond. Also, there will be a mark or line in the outer layer of a diamond, which is called blemishes. Based on the amount of inclusion and blemishes, the clarity of a diamond is categorized such as FL, IF, VSS1, VSS2, VS1, VS2, SI1, SI2, I1, I2, I3. The categories mentioned above are ordered in descending order by the amount of presence of inclusion and blemishes. 
 - _Depth_ [%]  
   Depth is the distance from a top surface i.e., table to a culet. The depth percentage is calculated by dividing the diamond depth by the overall width of a diamond. Lower the depth percentage the bigger the diamond looks from the below i.e., pavilion.
 - _Table_ [%]  
   The table is the topmost surface of a diamond and also the most significant facet of the round diamond. An appropriate width of a table will allow the light to enter and reflect on the appropriate direction .if not most of the light will scatter off in different directions. The table percentage is calculated by dividing the table width by overall diamond width.
 - _x_ / _y_ / _z_ [Mili Meter]  
   The dimension of a diamond is measured in millimeters. Moreover, the shape of a diamond is determined by the Length to width ratio. For instance, to determine the roundness of a diamond, we need to check the L/W ratio, If the ratio is between 1 and 1.05, it is a round diamond, and an oval shape diamond L/W ratio can be around 1.50 or less.  
   `x` -> Length, `y` -> width, `z` -> depth.

For more information look at [Diamonds Data Set](https://raw.githubusercontent.com/rithwiksarma/EDA-and-Classification---Diamonds-Dataset/main/updated-Diamonds-Project.pdf).

In [None]:
# Load the Data

# Data from CSV
# Pandas can read CSV data from **URL**'s and local files

diamondsCsvUrl  = r'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv'
dfDiamonds      = pd.read_csv(diamondsCsvUrl)
dfDiamonds #<! Shows the first 10 rows

In [None]:
# Rename the Columns

dColName = {'carat': 'Carat', 'cut': 'Cut', 'color': 'Color', 'clarity': 'Clarity', 'depth': 'Depth Ratio', 'table': 'Table Ratio', 'price': 'Price [$]', 'x': 'Length', 'y': 'Width', 'z': 'Depth'}

dfDiamonds.rename(columns = dColName, inplace = True)

print(f'The columns are given by: {dfDiamonds.columns}')

In [None]:
# Show Info of the Data Frame

print(f'The DF Shape is: {dfDiamonds.shape}')
print(f'The DF variables info: {dfDiamonds.info()}')

In [None]:
# The Type of Data
dVarType = {colName: 'Categorical' if dfDiamonds.dtypes[colName] == np.dtypes.ObjectDType else 'Continuous' for colName in dfDiamonds.columns}

In [None]:
# The Values
# Each column is a series with the given methods in: https://pandas.pydata.org/docs/reference/series.html

for colName in dfDiamonds:
    varType = dVarType[colName]
    if varType == 'Categorical':
        print(f'The {colName} variable is {varType} with values: {dfDiamonds[colName].unique()}')
    else:
        print(f'The {colName} variable is {varType} with values: [{dfDiamonds[colName].min()}, {dfDiamonds[colName].max()}]')

In [None]:
# Pandas Describe

dfDiamonds.describe()

In [None]:
# Price Distribution

hF, hA = plt.subplots(figsize = (10, 6))
sns.histplot(dfDiamonds, x = 'Price [$]', kde = True, ax = hA)
hA.set_title('Price Distribution');

The price distribution is skewed and not "Normal Like".  
In many cases the _Log Transform_ might generate a more "Normal Data" which is better suited for Linear Estimators.  

* <font color='brown'>(**#**)</font> The _Log Transform_ assists with Skewing and applicable to Positive data.  
  The underlying distribution assumption of the data is [Log Normal Distribution](https://en.wikipedia.org/wiki/Log-normal_distribution).
* <font color='brown'>(**#**)</font> Other alternatives are the _Square Root_ and _Box-Cox_ transformations.

In [None]:
# Price Distribution
# Use the Log Transform to make the data more "Normal Like".

hF, hA = plt.subplots(figsize = (10, 6))
sns.histplot(dfDiamonds, x = np.log(dfDiamonds['Price [$]']), kde = True, ax = hA)
# sns.histplot(dfDiamonds, x = 'Price [$]', kde = True, log_scale = True, ax = hA)
# sns.histplot(dfDiamonds, x = sp.stats.boxcox(dfDiamonds['Price [$]'])[0], kde = True, ax = hA)
hA.set_title('Log Price Distribution');

In the case above, the data is [BiModal distribution](https://en.wikipedia.org/wiki/Multimodal_distribution).  
It might be useful to cluster data by the modality. For instance, see [Otsu's Method](https://en.wikipedia.org/wiki/Otsu%27s_method). 

In [None]:
# Price Distribution per Cut

# Create SeaBorn Multi Plot
# Display the Histogram per Cut of the Diamond

oSnsGrid = sns.FacetGrid(dfDiamonds, col = 'Cut', hue = 'Cut', legend_out = False)
oSnsGrid.map(sns.histplot, 'Price [$]', stat = 'probability', kde = True, alpha = 0.55)
oSnsGrid.add_legend();

# oSnsGrid = sns.displot(data = dfDiamonds, x = 'Price [$]', hue = 'Cut', col = 'Cut', **{'stat': 'probability', 'kde': True, 'log_scale': True})

It is important, in the above, when comparing histograms to normalize by the quantity. See the `stat` parameter in `sns.histplot()`.

In [None]:
# Price Normalized by Carat per Cut

dfDiamonds['Price per Carat [$]'] = dfDiamonds['Price [$]'] / dfDiamonds['Carat']

oSnsGrid = sns.displot(data = dfDiamonds, x = 'Price per Carat [$]', hue = 'Cut', col = 'Cut', **{'stat': 'probability', 'kde': True, 'log_scale': True})

The Normalization by the Carat (Weight) with the Log Transform assisted in bringing the distribution closer to Normal Distribution.

In [None]:
# Price Distribution per Clarity

hF, hA = plt.subplots(figsize = (10, 6))
sns.violinplot(data = dfDiamonds, x = 'Clarity', y = 'Price per Carat [$]', inner = 'quartile', density_norm = 'area', ax = hA)
hA.set_title('Price per Carat [$] Distribution per Clarity');

In [None]:
# Price Distribution per Clarity

hF, hA = plt.subplots(figsize = (10, 6))
sns.boxplot(data = dfDiamonds, x = 'Color', y = 'Price per Carat [$]', log_scale = False, ax = hA)
hA.set_title('Price per Carat [$] Distribution per Clarity');

In [None]:
# Calculate the Price per Dimension
dfPriceDimension = dfDiamonds.melt(id_vars = 'Price [$]', value_vars = ['Length', 'Width', 'Depth'], var_name = 'Dimension Type', value_name = 'Dimension')

In [None]:
# Connection of Price to Length, Width, and Depth

hF, hA = plt.subplots(figsize = (12, 6))
sns.scatterplot(data = dfPriceDimension, x = 'Dimension', y = 'Price [$]', hue = 'Dimension Type', ax = hA, **{'alpha': 0.175})
hA.set_xlim((2, 10))
hA.set_title('Price per Carat [$] Distribution vs. Dimension');

In [None]:
# Connection of Price to Length, Width, and Depth

oSnsGrid = sns.FacetGrid(dfPriceDimension, col = 'Dimension Type', hue = 'Dimension Type', legend_out = False)
oSnsGrid.map(sns.regplot, 'Dimension', 'Price [$]', fit_reg = True, lowess = True, line_kws = {'color': 'k', 'lw': 5}, scatter_kws = {'alpha': 0.175})
oSnsGrid.set(xlim = (2, 10), ylim  = (0, 10_000))
oSnsGrid.add_legend();

The connection looks like exponential which makes sense as the price is closely related to the volume which is the multiplication of each.

In [None]:
# Calculate the Volume
dfDiamonds['Volume'] = dfDiamonds['Length'] * dfDiamonds['Width'] * dfDiamonds['Depth']

In [None]:
# Subset of the Data (Remove Outliers)

# By Price
quantileLow, quantilehigh = dfDiamonds['Price [$]'].quantile([0.03, 0.97])
dfFiltered = dfDiamonds[dfDiamonds['Price [$]'].between(quantileLow, quantilehigh, inclusive = 'both')]

# By Volume
quantileLow, quantilehigh = dfDiamonds['Volume'].quantile([0.03, 0.97])
dfFiltered = dfFiltered[dfFiltered['Volume'].between(quantileLow, quantilehigh, inclusive = 'both')]

dfFiltered = dfFiltered[['Price [$]', 'Volume']]

In [None]:
# Correlation
dfCorrelation = dfFiltered.corr()
dfCorrelation

In [None]:
# Random Sub Sample (Run Time)

dfFiltered = dfFiltered.sample(n = 10_000, random_state = seedNum)

In [None]:
# Price per Volume

hF, hA = plt.subplots(figsize = (12, 6))
sns.regplot(data = dfFiltered, x = 'Volume', y = 'Price [$]', lowess = True, line_kws = {'color': 'k', 'lw': 5}, scatter_kws = {'alpha': 0.175}, ax = hA)
hA.set_title('Price [$] vs. Volume');

In [None]:
# Subset of the Data (Remove Outliers)

# By Price
quantileLow, quantilehigh = dfDiamonds['Price [$]'].quantile([0.03, 0.97])
dfFiltered = dfDiamonds[dfDiamonds['Price [$]'].between(quantileLow, quantilehigh, inclusive = 'both')]

# By Volume
quantileLow, quantilehigh = dfDiamonds['Volume'].quantile([0.03, 0.97])
dfFiltered = dfFiltered[dfFiltered['Volume'].between(quantileLow, quantilehigh, inclusive = 'both')]

In [None]:
# Information per Property
# Using Group By

dfFiltered.groupby('Cut').agg({'Price [$]': ['mean', 'std', 'median', 'min', 'max'], 'Carat': ['mean', 'std', 'median', 'min', 'max'], 'Volume': ['mean', 'std', 'median', 'min', 'max']})

In [None]:
dfFiltered.groupby('Clarity').agg({'Price [$]': ['mean', 'std', 'median', 'min', 'max'], 'Carat': ['mean', 'std', 'median', 'min', 'max'], 'Volume': ['mean', 'std', 'median', 'min', 'max']})

## EDA - The JSE OK Cupid Data Set

Journal of Statistical Education Paper on Using OkCupid Data for Data Science Courses.

In [None]:
# Download the Data

dataFileUrl  = 'https://technionmail-my.sharepoint.com/:x:/g/personal/royia_technion_ac_il/ESJTvS3m9-ZFnnxqFE2ah4QBWbE2Sn9cQHDnMhg3ntnvhg?e=XxhVsu'
dataFileName = 'JSEOKCupidProfileData.csv'

if not os.path.exists(dataFileName):
    onedrivedownloader.download(dataFileUrl, dataFileName)

In [None]:
# Load Data
dfOkCupid = pd.read_csv('JSEOKCupidProfileData.csv') 
dfOkCupid

In [None]:
dColName = {
    'age': 'Age', 
    'body_type': 'BodyType', 
    'diet': 'Diet', 
    'drinks': 'Drinks', 
    'drugs': 'Drugs', 
    'education': 'Education', 
    'ethnicity': 'Ethnicity', 
    'height': 'Height', 
    'income': 'Income', 
    'job': 'Job',
    'offspring': 'Offspring',
    'orientation': 'Orientation',
    'pets': 'Pets',
    'religion': 'Religion',
    'sex': 'Sex', 
    'sign': 'Sign',
    'smokes': 'Smokes',
    'speaks': 'Speaks',
    'status': 'Status',
}

dfOkCupid.rename(columns = dColName, inplace = True)

print(f'The columns are given by: {dfOkCupid.columns}')

In [None]:
# Convert Units

# Convert the Height form Inches to Centimeters
dfOkCupid['Height'] = dfOkCupid['Height'].apply(lambda x: round(x * 2.54, ndigits = 2) if isinstance(x, (int, float)) else x)

In [None]:
# Remove Invalid Height Values

dfOkCupid = dfOkCupid.dropna(axis = 0, subset = ['Height'])
dfOkCupid = dfOkCupid[dfOkCupid['Height'].between(120, 220, inclusive = 'both')]

In [None]:
# Precompute helpers
lSign = ['Aries', 'Taurus', 'Gemini', 'Cancer', 'Leo', 'Virgo', 'Libra', 'Scorpio', 'Sagittarius', 'Capricorn', 'Aquarius', 'Pisces']

lSignLower = [s.lower() for s in lSign]
dSign      = {s.lower(): s for s in lSign} #<! Mapping of lower case sign to canonical name
# Whole Word regex: prevents "libraries" -> "Aries"
oWordRe = re.compile(r'\b(' + '|'.join(lSignLower) + r')\b', flags = re.IGNORECASE)

def GuessSign( inStr: object, *, valThr: float = 0.80) -> object:
    """
    Return the most probable zodiac sign from a free-text `text` string.
    If no confident match is found, return NaN.
    """
    
    if pd.isna(inStr):
        return np.nan

    inStrLower = str(inStr).lower()

    # 1) Exact whole-word match (fast, precise)
    m = oWordRe.search(inStrLower)
    if m:
        return dSign[m.group(1).lower()]

    # 2) Fuzzy fallback: compare each token to each sign; take best score
    lToken = re.findall(r"[a-z]+", inStrLower)  # simple tokenization
    if not lToken:
        return np.nan

    bestSign = None
    bestScore = 0.0
    for signCanon, signLower in zip(lSign, lSignLower):
        # best token to sign similarity
        for tok in lToken:
            score = difflib.SequenceMatcher(None, tok, signLower).ratio()
            if score > bestScore:
                bestScore = score
                bestSign = signCanon

    return bestSign if bestScore >= valThr else np.nan

In [None]:
# Clean the Sign Column

dfOkCupid['Sign'] = dfOkCupid['Sign'].apply(GuessSign)
dfOkCupid = dfOkCupid.dropna(axis = 0, subset = ['Sign'])
dfOkCupid['Sign'].unique() #<! Show the unique values in the Sign column

In [None]:
# Body Type Mapping
dBodyType = {
    'a little extra': 'Curvy',
    'average': 'Average',
    'think': 'Thin',
    'fit': 'Fit',
    'athletic': 'Fit',
    'curvy': 'Curvy',
    'skinny': 'Thin',
    'full figured': 'Curvy',
    'jacked': 'Fit',
    'overweight': 'Overweight',
    'used up': 'Average',
    'rather not say': None,
}

dfOkCupid['BodyType'] = dfOkCupid['BodyType'].map(dBodyType)
dfOkCupid = dfOkCupid.dropna(axis = 0, subset = ['BodyType'])
dfOkCupid['BodyType'].unique() #<! Show the unique values in the BodyType column

In [None]:
dSex = {
    'm': 'Male',
    'f': 'Female',
}
dfOkCupid['Sex'] = dfOkCupid['Sex'].map(dSex)
dfOkCupid['Sex'].unique() #<! Show the unique values

In [None]:
dOrientation = {
    'straight': 'Straight',
    'gay': 'Gay',
    'bisexual': 'Bisexual',
}

dfOkCupid['Orientation'] = dfOkCupid['Orientation'].map(dOrientation)
dfOkCupid['Orientation'].unique() #<! Show the unique values in the Orientation column

In [None]:
# The Sex and Orientation
dfCrossTab = pd.crosstab(dfOkCupid['Sex'], dfOkCupid['Orientation'])
dfCrossTab

In [None]:
# Long Form Table
dfCrossTab = dfCrossTab.unstack().reset_index().rename(columns = {0: 'Count'})
dfCrossTab

In [None]:
# The Sex and Orientation
# Stacked Bar Plot (PlotLy)

hF = px.bar(dfCrossTab, x = 'Sex', y = 'Count', color = 'Orientation')
hF.show()

In [None]:
# Normalize

dsTotal = dfCrossTab.groupby('Sex')['Count'].transform('sum')
dfCrossTab['Relative'] = (dfCrossTab['Count'] / dsTotal)
dfCrossTab

* <font color='brown'>(**#**)</font> When working with `LLM` you may find `print(df.to_markdown())` useful.

In [None]:
hF = px.bar(dfCrossTab, x = 'Sex', y = 'Relative', color = 'Orientation')
hF.show()

In [None]:
# Add Jitter to Sex for Plotting
dSex = {
    'Male': 0,
    'Female': 1
}
dfOkCupid['SexJitter'] = dfOkCupid['Sex'].map(dSex) + np.random.uniform(-0.2, 0.2, size = dfOkCupid.shape[0])

In [None]:
# Logistic Regression
oLogReg = LogisticRegression(solver = 'liblinear', random_state = seedNum)
oLogReg.fit(dfOkCupid[['Height']].to_numpy(), dfOkCupid['Sex'].map(dSex))

vHeight  = np.linspace(dfOkCupid['Height'].min(), dfOkCupid['Height'].max(), num = 1_000)
vSexProb = oLogReg.predict_proba(vHeight.reshape(-1, 1))[:, 1] #<! Probability for 1

In [None]:
# Logistic Regression for Sex by Height
hFig = px.scatter(dfOkCupid, x = 'Height', y = 'SexJitter', color = 'Sex', subtitle = 'Sex vs. Height', width = 850, height = 500)
hFig.add_trace(go.Scatter(x = vHeight, y = vSexProb, mode = 'lines', name = 'Logistic Regression', line = dict(color = 'black', width = 2)))
hFig.update_layout(
    yaxis = dict(
        tickmode = 'array',
        tickvals = [0, 1],  #<! Ticks
        ticktext = ['Male', 'Female'], #<! Tick labels
        title = 'Sex',
    )
)
hFig.show()

### Measures of Distribution

Many measures are related to the moments of the data.

#### Skewness

Calculated by the 3rd moment:

$$ \operatorname{skew} \left( X \right) = \mathbb{E} \left[ {\left( \frac{ X - \mu }{\sigma} \right)}^{3} \right] = \frac{ \mathbb{E} \left[ {X}^{3} \right] - 3 \mu {\sigma}^{2} - {\mu}^{3} }{ {\sigma}^{3} } $$

Measures how asymmetric the distribution is.

![](https://i.imgur.com/bKBT0sP.png)
<!-- ![](https://i.postimg.cc/MK1mGYL6/Diagrams-Skewness.png) -->

* <font color='brown'>(**#**)</font> For alternative measures of Skewness see [Wikipedia - Other Measures of Skewness](https://en.wikipedia.org/wiki/Skewness#Other_measures_of_skewness).

#### Kurtosis

Calculated by the 4th moment.

$$ \operatorname{kurt} \left( X \right) = \mathbb{E} \left[ {\left( \frac{ X - \mu }{\sigma} \right)}^{4} \right] = \frac{ {\mu}_{4} }{ {\sigma}^{4} } $$

Reflects either the presence of existing outliers (Sample kurtosis) or the tendency to produce outliers (Kurtosis of a probability distribution).  
It is usually compared to a Normal Distribution which has a Kurtosis of 3. It is called _Excess Kurtosis_:

$$ \tilde{\operatorname{kurt}} \left( X \right) = \operatorname{kurt} \left( X \right) - 3 $$


#### Transformations

There are som transformations to make teh distribution closer to _Normal_:

 - [Power Transformation (Box Cox Transformation)](https://en.wikipedia.org/wiki/Power_transform).
 - Using $\sqrt{\cdot}$ and $\log \left( \cdot \right)$ (See [Variance Stabilizing Transformation](https://en.wikipedia.org/wiki/Variance-stabilizing_transformation)).
 - See [Advanced Normalizing Distribution Transformations](https://stats.stackexchange.com/questions/1601).


</br>


* <font color='brown'>(**#**)</font> Kurtosis and Skewness can be used to check for the modality of the distribution. See [Measure Modality of a Distribution](https://stats.stackexchange.com/questions/395908).

In [None]:
# Generate Samples to Show the Different Distributions
numSamples = 500

dDist = {
    'Normal': lambda: np.random.normal(0, 1, size = numSamples),
    'Log Normal': lambda: np.random.lognormal(0, 1, size = numSamples),
    'Beta (20, 2)': lambda: np.random.beta(20, 2, size = numSamples),
    'Laplace': lambda: np.random.laplace(0, 1, size = numSamples),
    'Student T (3)': lambda: np.random.standard_t(3, size = numSamples),
    'Uniform': lambda: np.random.uniform(-2, 2, size = numSamples),
    'Bimodal Symmetric': lambda: np.concatenate([np.random.normal(-2, 0.6, size = numSamples // 2), np.random.normal(2, 0.6, size = numSamples - numSamples // 2)]),
    'Bimodal Skewed': lambda: np.concatenate([np.random.normal(0, 1, size = int(numSamples * 0.8)), np.random.normal(4, 0.7, size = numSamples - int(numSamples * 0.8))]),
    'Trimodal': lambda: np.concatenate([np.random.normal(-3, 0.5, size = numSamples // 3), np.random.normal(0, 0.5, size = numSamples // 3), np.random.normal(3, 0.5, size = numSamples - 2 * (numSamples // 3))]),
}

hF, vHa = plt.subplots(nrows = 3, ncols = 3, figsize = (16, 11))
vHa = vHa.flat

for ii, (distName, distFunc) in enumerate(dDist.items()):
    hA = vHa[ii]
    vS = distFunc()
    valMean = np.mean(vS)
    valStd  = np.std(vS)
    valMedian = np.median(vS)
    valSkew = sp.stats.skew(vS)
    valKurt = sp.stats.kurtosis(vS)
    sns.histplot(vS, kde = True, stat = 'density', ax = hA)
    hA.set_title(f'{distName} Distribution')

    hA.axvline(valMean, color = 'red', linestyle = '--', label = f'Mean: {valMean:.2f}')
    hA.axvline(valMedian, color = 'green', linestyle = '-.', label = f'Median: {valMedian:.2f}')
    hA.axvline(valMean - valStd, color = 'orange', linestyle = ':', label = f'Std: {valStd:.2f}')
    hA.axvline(valMean + valStd, color = 'orange', linestyle = ':')
    # Test of the Skewness and Kurtosis
    hA.text(0.30, 0.15, f'Skew: {valSkew:.2f}\nKurtosis: {valKurt:.2f}', horizontalalignment = 'right', verticalalignment = 'top', transform = hA.transAxes, bbox = dict(facecolor = 'white', alpha = 0.5))
    hA.legend()

In [None]:
# Skewness Analysis
vX = np.linspace(0, 1, 1_000)

tuData = (
    (2, 5),
    (5, 5),
    (5, 2),
)


hF, vHa = plt.subplots(nrows = 1, ncols = 3, figsize = (14, 5))
vHa = vHa.flat

for ii, (a, b) in enumerate(tuData):
    oDistBeta = sp.stats.beta(a = a, b = b)
    hA = vHa[ii]
    hA.plot(vX, oDistBeta.pdf(vX), color = 'blue', lw = 2)
    # Remove the top and right spines
    hA.spines['top'].set_visible(False)
    hA.spines['right'].set_visible(False)   
    # Remove Ticks and Labels
    hA.set_axis_off()

# hF.savefig('TMP.svg', transparent = True, dpi = 100, bbox_inches = None)

In [None]:
# Kurtosis Analysis

vX = np.linspace(-5, 5, 1_000)

dDist = {
    'Normal (0, 1)': sp.stats.norm(0, 1),
    'T Student (3)': sp.stats.t(3),
    'Laplace (0, 1)': sp.stats.laplace(0, 1),
    'Beta (2, 2)': sp.stats.beta(2, 2, loc = -0.5),
}

hF, hA = plt.subplots(figsize = (6, 4))

for ii, (distName, oDist) in enumerate(dDist.items()):
    valKurt = oDist.stats(moments = 'k')
    hA.plot(vX, oDist.pdf(vX), lw = 2, label = distName + ', Kurtosis: ' + f'{valKurt:.2f}')


hA.set_xticks([])
hA.set_yticks([])


hA.legend(loc = 'upper left', fontsize = 8)

# hF.savefig('TMP.svg', transparent = True, dpi = 100, bbox_inches = None)

In [None]:
# Generate Samples to Show the Different Distributions
numSamples = 5_000


vS1  = np.random.beta(2, 3, size = numSamples)
vS2  = np.random.lognormal(0.5, 0.5, size = numSamples)
vS1T = np.sqrt(vS1)
vS2T = np.log(vS2)

hF, hA = plt.subplots(figsize = (6, 4))
sns.histplot(vS1, kde = True, stat = 'density', ax = hA)
hA.set_xticks([])
hA.set_yticks([])
hA.set_ylabel(None)
# hF.savefig('TMP.svg', transparent = True, dpi = 100, bbox_inches = None)

hF, hA = plt.subplots(figsize = (6, 4))
sns.histplot(vS1T, kde = True, stat = 'density', ax = hA)
hA.set_xticks([])
hA.set_yticks([])
hA.set_ylabel(None)
# hF.savefig('TMP.svg', transparent = True, dpi = 100, bbox_inches = None)

hF, hA = plt.subplots(figsize = (6, 4))
sns.histplot(vS2, kde = True, stat = 'density', ax = hA)
hA.set_xticks([])
hA.set_yticks([])
hA.set_ylabel(None)
# hF.savefig('TMP.svg', transparent = True, dpi = 100, bbox_inches = None)

hF, hA = plt.subplots(figsize = (6, 4))
sns.histplot(vS2T, kde = True, stat = 'density', ax = hA)
hA.set_xticks([])
hA.set_yticks([])
hA.set_ylabel(None)
# hF.savefig('TMP.svg', transparent = True, dpi = 100, bbox_inches = None)