**Table of Contents**

>[Descripton](#scrollTo=tS1kJQHrUEX0)

>[Import Libraries](#scrollTo=YbwVTfLLDRxt)

>[Create Dataset](#scrollTo=s_VPiogIDOoL)

>[Marginal Distribution](#scrollTo=QlGGA0DGPJqj)

>>[Compute Marginal numbers](#scrollTo=Zan6XXWOgkcB)

>>[Compute joint distribution](#scrollTo=txNz9jAigsT6)

>>[Extract some usefull information](#scrollTo=meUF0UqwHG6A)

>>[AND](#scrollTo=VuWEHhDkqKr1)

>>[OR](#scrollTo=KncuMPLnqPPm)

>>[Conditional Probability](#scrollTo=s_2rRTfLqTWN)

>>[Independence and dependence](#scrollTo=zArswkqWtQfP)



# Descripton

**In this notebook, first of all, I will create a handmade dataset(it is not real), it's about 3 types of diseases that are widespread to all continents. My goal is to show how we can easily get useful information from MARGINAL DISTRIBUTIONS. I also show how we can easily work with handmade datasets using the Pandas library.**

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Create Dataset

First and foremost, I created 3 main parts of a data frame (table).

The 3 main parts are columns, rows, and data

In [2]:
columns = ['North America', 'South America', 'Asia', 'Australia', 'Europe', 'Africa']
rows = ['Type A', 'Type B', 'Type C']
data = [
          [150, 120, 540, 15, 180, 20],
          [100, 10, 50, 120, 130, 70],
          [90, 200, 750, 140, 10, 95]
      ]
dataset = pd.DataFrame(data=data, index=rows, columns=columns)

In [3]:
dataset

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa
Type A,150,120,540,15,180,20
Type B,100,10,50,120,130,70
Type C,90,200,750,140,10,95


This is our raw dataframe.

# Marginal Distribution

## Compute Marginal numbers

For MARGINAL DISTRIBUTIONS we need a total number of each row and column. so let's compute them.

In [4]:
dataset['TOTAL'] = dataset.sum(axis=1)

In [5]:
dataset

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,150,120,540,15,180,20,1025
Type B,100,10,50,120,130,70,480
Type C,90,200,750,140,10,95,1285


In [6]:
dataset.loc['TOTAL'] = dataset.sum(axis=0)

In [7]:
dataset

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,150,120,540,15,180,20,1025
Type B,100,10,50,120,130,70,480
Type C,90,200,750,140,10,95,1285
TOTAL,340,330,1340,275,320,185,2790


## Compute joint distribution

In [8]:
def color_map(val):
  for i in range(dataset['TOTAL'].size):
    if val == dataset['TOTAL'][i]:
      color = 'green'
      return f'color: {color}'

  for j in range(dataset.loc['TOTAL'].size):
    if val == dataset.loc['TOTAL'][j]:
      color = 'green'
      return f'color: {color}'

In [9]:
styled_data = dataset.style.applymap(color_map)

In [10]:
styled_data

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,150,120,540,15,180,20,1025
Type B,100,10,50,120,130,70,480
Type C,90,200,750,140,10,95,1285
TOTAL,340,330,1340,275,320,185,2790


The above data frame has marginal numbers that show the sum of each row and column items, but we need the probability of each cell, it's called joint distribution. For this goal, we should divide each cell value by the total number

In [11]:
joint_dist = round(dataset / 2790, 3)

In [12]:
joint_dist

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,0.054,0.043,0.194,0.005,0.065,0.007,0.367
Type B,0.036,0.004,0.018,0.043,0.047,0.025,0.172
Type C,0.032,0.072,0.269,0.05,0.004,0.034,0.461
TOTAL,0.122,0.118,0.48,0.099,0.115,0.066,1.0


Now we have a table with marginal joint distributions. with this data, we can get some important information

## Extract some usefull information

**1. For example, we can easily see that the percentage of Asian people over all people in this table is 48% because the sum of Asians in total is 0.48 over 1.00**

In [13]:
asian = joint_dist['Asia'][-1]
print(f'percent of Asian: {asian*100}%')

percent of Asian: 48.0%


In [42]:
def color_map(val):
  if val == joint_dist['Asia'][-1]:
    color = 'green'
    return f'color: {color}'

In [43]:
joint_dist.style.applymap(color_map)

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,0.054,0.043,0.194,0.005,0.065,0.007,0.367
Type B,0.036,0.004,0.018,0.043,0.047,0.025,0.172
Type C,0.032,0.072,0.269,0.05,0.004,0.034,0.461
TOTAL,0.122,0.118,0.48,0.099,0.115,0.066,1.0


**2. Almost 37% of people in this table have Type A disease**

In [14]:
typeA = joint_dist.loc['Type A'][-1]
print(f'percent of Type A: {typeA*100}%')

percent of Type A: 36.7%


In [45]:
def color_map(val):
  if val == joint_dist.loc['Type A'][-1]:
    color = 'green'
    return f'color: {color}'

In [46]:
joint_dist.style.applymap(color_map)

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,0.054,0.043,0.194,0.005,0.065,0.007,0.367
Type B,0.036,0.004,0.018,0.043,0.047,0.025,0.172
Type C,0.032,0.072,0.269,0.05,0.004,0.034,0.461
TOTAL,0.122,0.118,0.48,0.099,0.115,0.066,1.0


## AND

**Question: What does each number in a cell show?**

**Answer: each number in a cell is considered a joint distribution**

For instance, we can find the percentage of people from NORTH AMERICA who have TYPE B disease is: almost 3.6%

(people who are from North America **AND** have Type B)

In [40]:
northAmerica = joint_dist['North America']
northAmerica_TypeB = northAmerica.loc['Type B']
print(f'North America and Type B: {round(northAmerica_TypeB*100, 2)}%')

North America and Type B: 3.6%


In [47]:
def color_map(val):
  if val == joint_dist['North America'][1]:
    color = 'green'
    return f'color: {color}'

In [48]:
joint_dist.style.applymap(color_map)

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,0.054,0.043,0.194,0.005,0.065,0.007,0.367
Type B,0.036,0.004,0.018,0.043,0.047,0.025,0.172
Type C,0.032,0.072,0.269,0.05,0.004,0.034,0.461
TOTAL,0.122,0.118,0.48,0.099,0.115,0.066,1.0


## OR

**If we consider a whole row and a whole column and we add up each cell in this row and column we would have OR operand.**

for example: if we want to know the percent of people who are from Australia **OR** have Type C, we should sum the cells that belong to this column and row.

In [21]:
australia = joint_dist['Australia'][0:-1].sum()
typeC = joint_dist.loc['Type C'][:-1].sum() - joint_dist['Australia'][2]
print(f'TypeC OR Australia: {(typeC + australia)*100}%')

TypeC OR Australia: 50.9%


Another way is that we can add the marginal distribution of Australia and type C and then subtract from the intersection of the two because we are counting it twice:

(A union B) = A + B - (A intersecton B)



In [59]:
Aus_union_typeA = (joint_dist['Australia'][-1] + joint_dist.loc['Type A'][-1]) - joint_dist['Australia'][2]
print(f'TypeA OR Australia: {Aus_union_typeA*100}%')

TypeA OR Australia: 41.6%


In [60]:
def color_map(val):
  for i in range(joint_dist['Australia'].size-1):
    if val == joint_dist['Australia'][i]:
      color = 'green'
      return f'color: {color}'

  for j in range(joint_dist.loc['Type A'].size-1):
    if val == joint_dist.loc['Type A'][j]:
      color = 'green'
      return f'color: {color}'

In [61]:
joint_dist.style.applymap(color_map)

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,0.054,0.043,0.194,0.005,0.065,0.007,0.367
Type B,0.036,0.004,0.018,0.043,0.047,0.025,0.172
Type C,0.032,0.072,0.269,0.05,0.004,0.034,0.461
TOTAL,0.122,0.118,0.48,0.099,0.115,0.066,1.0


**Both 2 ways have the same result**

## Conditional Probability

By using marginal distribution we can simply compute CONDITIONAL PROBABILITY

For instance: If we know a patient has Type B disease, what is the probability of being European?

P(European | TypeB) = P(Europe intersection TypeB) / P(TypeB)

In [94]:
eu_typeB = joint_dist['Europe'][1]
typeB = joint_dist.loc['Type B'][-1]
print(f'P(Europe | TypeB): {round(eu_typeB / typeB, 2)}')

P(Europe | TypeB): 0.27


In [97]:
def color_map(val):
  for j in range(joint_dist.loc['Type B'].size-1):
    if val == joint_dist.loc['Type B'][j]:
      if val == joint_dist['Europe'][1]:
        color = 'blue'
      else:
        color = 'green'
      return f'color: {color}'

In [98]:
joint_dist.style.applymap(color_map, subset=pd.IndexSlice['Type B', :])

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,0.054,0.043,0.194,0.005,0.065,0.007,0.367
Type B,0.036,0.004,0.018,0.043,0.047,0.025,0.172
Type C,0.032,0.072,0.269,0.05,0.004,0.034,0.461
TOTAL,0.122,0.118,0.48,0.099,0.115,0.066,1.0


##  Independence and dependence

Using Marginal Distribution, we can also check the dependency and independence.

Two Random Variables A and B are independent if:

1. P(A intersection B) = P(A) * P(B)

or

2. P(A | B) = P(A)

In these two examples, we want to check if Disease is independent from the Continent.

We use the `P(A intersection B) = P(A) * P(B)` formula, but you can also use `P(A | B) = P(A)`.

In [109]:
typeC_intersect_SouthAmerica = joint_dist['South America'][2]
typeC = joint_dist.loc['Type C'][-1]
SouthAmerica = joint_dist['South America'][-1]

if typeC_intersect_SouthAmerica == round(typeC * SouthAmerica):
  print('They are independent')
else:
  print('They are dependent')

print(f'TypeC intersect SouthAmerica: {typeC_intersect_SouthAmerica}, typeC: {typeC} * SouthAmerica: {SouthAmerica} = {round(SouthAmerica*typeC, 2)}')

They are dependent
TypeC intersect SouthAmerica: 0.072, typeC: 0.461 * SouthAmerica: 0.118 = 0.05


In [107]:
def color_map(val):
  if val == joint_dist['South America'][-1] or val == joint_dist.loc['Type C'][-1]:
    color = 'green'
    return f'color: {color}'
  elif val == joint_dist['South America'][2]:
    color = 'blue'
    return f'color: {color}'

In [108]:
joint_dist.style.applymap(color_map)

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,0.054,0.043,0.194,0.005,0.065,0.007,0.367
Type B,0.036,0.004,0.018,0.043,0.047,0.025,0.172
Type C,0.032,0.072,0.269,0.05,0.004,0.034,0.461
TOTAL,0.122,0.118,0.48,0.099,0.115,0.066,1.0


Let's compute dependency for another cell

In [36]:
typeA_intersect_Africa = joint_dist['Africa'][0]
typeA = joint_dist.loc['Type A'][-1]
Africa = joint_dist['Africa'][-1]

if typeA_intersect_Africa == round(typeA*Africa):
  print('They are independent')
else:
  print('They are dependent')

print(f'TypeA intersect Africa: {typeA_intersect_Africa}, typeA: {typeA} * Africa:{Africa} = {round(Africa*typeA, 2)}')

They are dependent
TypeA intersect Africa: 0.007, typeA: 0.367 * Africa:0.066 = 0.02


In [112]:
def color_map(val):
  if val == joint_dist['Africa'][-1] or val == joint_dist.loc['Type A'][-1]:
    color = 'green'
    return f'color: {color}'
  elif val == joint_dist['Africa'][0]:
    color = 'blue'
    return f'color: {color}'

In [113]:
joint_dist.style.applymap(color_map)

Unnamed: 0,North America,South America,Asia,Australia,Europe,Africa,TOTAL
Type A,0.054,0.043,0.194,0.005,0.065,0.007,0.367
Type B,0.036,0.004,0.018,0.043,0.047,0.025,0.172
Type C,0.032,0.072,0.269,0.05,0.004,0.034,0.461
TOTAL,0.122,0.118,0.48,0.099,0.115,0.066,1.0


And that's it :))