<center>
    <h1 id='categorical-features-statistic-i' style='color:#7159c1'>🧮 Categorical Features Statistic I 🧮</h1>
    <i>Exploring Categorical Features</i>
</center>

```
- Liker Scale
- Mode
- Frequence and Cross Tables
```

In [1]:
import pandas as pd # pip install pandas

df = pd.read_csv('./datasets/students.csv')
df

Unnamed: 0,name,main_breed,position,favorite_color,age
0,goku,sayan,low,purple,45
1,vegeta,sayan,high,purple,73
2,broly,sayan,high,purple,23
3,gohan,sayan,low,black,64
4,granolah,alien,low,purple,56
5,majin buu,alien,high,black,37
6,freeza,alien,high,black,61
7,piccolo,alien,low,black,24
8,gamma 1,android,high,purple,50
9,android 17,android,low,purple,78


<p id='0-likert-scale' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Likert Scale</p>

`Likert Scale` is a great technique to collect people feedback about something. Even though its answers can be encoded into numbers, it does not change the fact that thei are `Categorical Variables`.

Usually, Likert Scale contains questions about people satisfaction related to products or services and the possible answers are: `Strongly Disagree, Disagree, Neither Agree nor Disagree / Neutral, Agree and Strongly Agree` and `Horrible, Bad, Good and Great`.

<figure>
    <img src='./assets/0 - Likert Scale.png' alt='Likert Scale' />
    <figcaption>Figure 0 - Example of Likert Scale and briefly explanation about the possible answers for the questions. © <a href='https://slidemodel.com/likert-scale-quick-guide/' target='_blank'>SlideModel.com</a></figcaption>
</figure>

**Good-Practice** - always use a pair amount of possible answers and do not use neutral/balanced options, because they influence the person to not express its real thoughts. So, instead of be using `Horrible, Bad, Neutral, Good and Great`, use `Horrible, Bad, Good and Great` in order to avoid the person answering `Neutral` and express its real thoughts.

<p id='1-mode' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>1 | Mode</p>

Take a look at the `students` dataset at the beggining of this notebook - yeah, there are some Dragon Ball Characters Data in it 😐 -. Consider that you got to know a new student will be in the class next day and nobody knows who it is.

If you have to take a guess about what is the main breed of this student, what would you guess: sayan, alien or android? You can choose by intuition or you can use Statistic!

Using Statistic, you can guess saying the most frequent main breed, that is, sayan! And since main breed is a Categorical Variable, you call the most frequent value `mode`.

It is intuitive, you find what is the most frequent value and take your guess using it!! When there are two most frequent values, they both are the mode!

In [2]:
# ---- Mode ----
#
#   Spoiler: the mode is 'alien' and 'sayan', being both present four
# times in the dataset. 'Alien' only appeared first because Pandas
# returns the values in ascending order
#
df['main_breed'].mode()

0    alien
1    sayan
Name: main_breed, dtype: object

In [3]:
# ---- Mode ----
#
# - Another example using 'favorite_color' variable
#
df['favorite_color'].mode()

0    purple
Name: favorite_color, dtype: object

<p id='2-frequence-and-cross-tables' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>2 | Frequence and Cross Tables</p>

`Frequence Tables` shows the value counts of the possible values of a variable and, to do it, we can use `Cross Tables` from Pandas. That is it! - my way of studying is writing a briefly and direct explanation about something and them show examples, it is easier to understand. But yes, there are exceptions, for instance, when the content is too complex or dense.

---

**- One Way Frequence Table (One Dimensional)**

In [4]:
# ---- One Way Frequence Table ----
#
# - simple frequencies using cross table
#
frequency = pd.crosstab(
    index=df['main_breed']
    , columns='count'
)

frequency

col_0,count
main_breed,Unnamed: 1_level_1
alien,4
android,2
sayan,4


In [5]:
# ---- One Way Frequence Table ----
#
# - proportion (%) using cross table
#
frequency / frequency.sum()

col_0,count
main_breed,Unnamed: 1_level_1
alien,0.4
android,0.2
sayan,0.4


In [6]:
# ---- One Way Frequence Table ----
#
# - simple frequencies without using cross table
#
df['main_breed'].value_counts()

sayan      4
alien      4
android    2
Name: main_breed, dtype: int64

In [7]:
# ---- One Way Frequence Table ----
#
# - proportion (%) without using cross table
#
df['main_breed'].value_counts() / df['main_breed'].size

sayan      0.4
alien      0.4
android    0.2
Name: main_breed, dtype: float64

---

**- Two Way Table (Two Dimensions)**

In [8]:
# ---- Two Way Frequence Table ----
#
# - simple frequencies without margin counts
#
frequency = pd.crosstab(
    index=df['main_breed']
    , columns=df['favorite_color']
)

frequency

favorite_color,black,purple
main_breed,Unnamed: 1_level_1,Unnamed: 2_level_1
alien,3,1
android,0,2
sayan,1,3


In [9]:
# ---- Two Way Frequence Table ----
#
# - simple frequencies with margin counts
#
frequency = pd.crosstab(
    index=df['main_breed']
    , columns=df['favorite_color']
    , margins=True
)

frequency

favorite_color,black,purple,All
main_breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alien,3,1,4
android,0,2,2
sayan,1,3,4
All,4,6,10


In [10]:
# ---- Two Way Frequence Table ----
#
# - proportion (%) in general (columns and rows)
#
frequency / frequency.loc['All', 'All']

favorite_color,black,purple,All
main_breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alien,0.3,0.1,0.4
android,0.0,0.2,0.2
sayan,0.1,0.3,0.4
All,0.4,0.6,1.0


In [11]:
# ---- Two Way Frequence Table ----
#
# - proportion (%) by columns
#
frequency / frequency.loc['All']

favorite_color,black,purple,All
main_breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alien,0.75,0.166667,0.4
android,0.0,0.333333,0.2
sayan,0.25,0.5,0.4
All,1.0,1.0,1.0


In [12]:
# ---- Two Way Frequency Table ----
#
# - proportion (%) by rows
#
frequency.div(
	frequency['All']
	, axis=0
)

favorite_color,black,purple,All
main_breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
alien,0.75,0.25,1.0
android,0.0,1.0,1.0
sayan,0.25,0.75,1.0
All,0.4,0.6,1.0


---

**- Higher Way Frequency Table (Multi Dimensions)**

In [13]:
# ---- Higher Way Frequency Table ----
#
# - simple frequencies
#
frequency = pd.crosstab(
    index=df['main_breed']
    , columns=[df['favorite_color'], df['position']]
    , margins=True
)

frequency

favorite_color,black,black,purple,purple,All
position,high,low,high,low,Unnamed: 5_level_1
main_breed,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
alien,2,1,0,1,4
android,0,0,1,1,2
sayan,0,1,2,1,4
All,2,2,3,3,10


In [14]:
# ---- Higher Way Frequency Table ----
#
# - proportion (%) by columns
#
frequency / frequency.loc['All']

favorite_color,black,black,purple,purple,All
position,high,low,high,low,Unnamed: 5_level_1
main_breed,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
alien,1.0,0.5,0.0,0.333333,0.4
android,0.0,0.0,0.333333,0.333333,0.2
sayan,0.0,0.5,0.666667,0.333333,0.4
All,1.0,1.0,1.0,1.0,1.0


---

**- Personalized Frequence Table**

In [15]:
# ---- Personalized Frequence Table ----
def create_frequency_table(dataset, variable):
    """
    \ Description:
        - calculates the cross table of a variable from a dataset;
        - drops margin column;
        - adds 'percentage' column into the cross table;
        - sorts by the variable's values;
        - returns the result
    
    \ Parameters:
        - dataset: Pandas DataFrame;
        - variable: string.
    """
    frequencies = pd.crosstab(
        index=dataset[variable]
        , columns='count'
        , colnames=['']
        , margins=True
    )
    frequencies = frequencies.iloc[:, :-1] # removing 'All' margin column
    total_count = frequencies.loc['All', 'count'] # getting 'count' value at 'all' row
    frequencies['percentage'] = round(frequencies['count'] / total_count, 4) # four significant digits
    frequencies.sort_values(by='count', ascending=True)
    return frequencies

create_frequency_table(df, 'main_breed')

Unnamed: 0_level_0,count,percentage
main_breed,Unnamed: 1_level_1,Unnamed: 2_level_1
alien,4,0.4
android,2,0.2
sayan,4,0.4
All,10,1.0


<p id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</p>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).