[Home](../../README.md)

### Feature Engineering

This Jupyter Notebook demonstrates various feature engineering processes you can apply to your poker dataset to improve the performance of your machine learning model. For this demonstration, we will engineer new or improved features for the poker hand dataset you previously wrangled.

#### Feature Engineering Process
- **Deriving New Variables from Existing Ones**:
    - Encoding categorical features (e.g., encoding suits numerically or one-hot encoding them).
    - Combining ranks and suits into a single feature (e.g., encoding a card as a unique value based on its rank and suit).
    - Calculating new features, such as the total rank sum, the number of cards with the same rank, or the number of cards with the same suit.

- **Combining Features/Feature Interactions**:
    - Creating features that represent poker hand strength (e.g., checking for pairs, flushes, straights, etc.).
    - Identifying patterns in the cards, such as the highest card rank or the number of consecutive ranks.

- **Transforming Features**:
    - Normalizing or scaling numerical features (e.g., scaling ranks and suits to a range of 0 to 1).
    - Applying mathematical transformations (e.g., logarithmic transformations) if necessary to reduce skewness or improve feature distribution.

- **Creating Domain-Specific Features**:
    - Incorporating poker-specific knowledge to create features that capture important characteristics of the data, such as:
        - Whether the hand contains a flush (all cards of the same suit).
        - Whether the hand contains a straight (consecutive ranks).
        - The number of pairs, three-of-a-kinds, or four-of-a-kinds in the hand.

#### Load the required dependencies

In [2]:
# Import frameworks
import pandas as pd

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [3]:
data_frame = pd.read_csv("poker_hand_dataset_wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding Categorical Variables

Data encoding converts categorical data, such as suits of cards, into numerical format so that it can be used as input for machine learning algorithms. Most machine learning algorithms work with numbers and not with text or categorical variables.

In this dataset, the suits of cards (`Suit of Card 1`, `Suit of Card 2`, etc.) are already represented numerically (e.g., 1 for Clubs, 2 for Diamonds, 3 for Hearts, 4 for Spades). However, if needed, we can further encode these values into other formats, such as one-hot encoding, to make them more suitable for certain algorithms.

For example:
- **Label Encoding**: Suits are already encoded as integers (1 to 4).
- **One-Hot Encoding**: Each suit can be represented as a binary vector (e.g., `[1, 0, 0, 0]` for Clubs).


In [4]:
# Count the number of unique ranks in the hand
data_frame['Unique Ranks'] = data_frame[['Rank of Card 1', 'Rank of Card 2', 'Rank of Card 3', 'Rank of Card 4', 'Rank of Card 5']].apply(lambda row: len(set(row)), axis=1)

# Count the number of unique suits in the hand
data_frame['Unique Suits'] = data_frame[['Suit of Card 1', 'Suit of Card 2', 'Suit of Card 3', 'Suit of Card 4', 'Suit of Card 5']].apply(lambda row: len(set(row)), axis=1)

##### Determining Poker Hand Type

In poker, the type of hand (e.g., "Flush", "Straight", "Full House") determines its strength. This feature is critical for understanding the dataset and making predictions. We will calculate the type of poker hand based on the ranks and suits of the cards in the hand. The hand type will be added as a new column called `Hand Type`.

In [5]:
# Reverse scaling for rank columns (assuming min-max normalization)
MIN_RANK = 1
MAX_RANK = 13
rank_columns = ['Rank of Card 1', 'Rank of Card 2', 'Rank of Card 3', 'Rank of Card 4', 'Rank of Card 5']

# Reverse the scaling
data_frame[rank_columns] = data_frame[rank_columns].apply(lambda x: (x * (MAX_RANK - MIN_RANK) + MIN_RANK).round().astype(int))

In [6]:
# Function to determine the type of poker hand
def determine_hand_type(row):
    ranks = sorted([row['Rank of Card 1'], row['Rank of Card 2'], row['Rank of Card 3'], row['Rank of Card 4'], row['Rank of Card 5']])
    suits = [row['Suit of Card 1'], row['Suit of Card 2'], row['Suit of Card 3'], row['Suit of Card 4'], row['Suit of Card 5']]
    
    # Check for Flush (all suits are the same)
    is_flush = len(set(suits)) == 1
    
    # Check for Straight (consecutive ranks)
    is_straight = ranks == list(range(ranks[0], ranks[0] + 5))
    
    # Count occurrences of each rank
    rank_counts = {rank: ranks.count(rank) for rank in set(ranks)}
    rank_count_values = sorted(rank_counts.values(), reverse=True)
    
    # Determine hand type
    if is_flush and is_straight and ranks[-1] == 13:  # Royal Flush
        return 'Royal Flush'
    elif is_flush and is_straight:  # Straight Flush
        return 'Straight Flush'
    elif rank_count_values == [4, 1]:  # Four of a Kind
        return 'Four of a Kind'
    elif rank_count_values == [3, 2]:  # Full House
        return 'Full House'
    elif is_flush:  # Flush
        return 'Flush'
    elif is_straight:  # Straight
        return 'Straight'
    elif rank_count_values == [3, 1, 1]:  # Three of a Kind
        return 'Three of a Kind'
    elif rank_count_values == [2, 2, 1]:  # Two Pair
        return 'Two Pair'
    elif rank_count_values == [2, 1, 1, 1]:  # One Pair
        return 'One Pair'
    else:  # High Card
        return 'High Card'

In [7]:
# Ensure rank columns are integers
rank_columns = ['Rank of Card 1', 'Rank of Card 2', 'Rank of Card 3', 'Rank of Card 4', 'Rank of Card 5']
data_frame[rank_columns] = data_frame[rank_columns].fillna(0).astype(int)

# Ensure suit columns are integers (if applicable)
suit_columns = ['Suit of Card 1', 'Suit of Card 2', 'Suit of Card 3', 'Suit of Card 4', 'Suit of Card 5']
data_frame[suit_columns] = data_frame[suit_columns].fillna(0).astype(int)

In [8]:
# Apply the function to determine the hand type
data_frame['Hand Type'] = data_frame.apply(determine_hand_type, axis=1)

# Verify the first few rows
print(data_frame[['Rank of Card 1', 'Rank of Card 2', 'Rank of Card 3', 'Rank of Card 4', 'Rank of Card 5', 'Hand Type']].head())

   Rank of Card 1  Rank of Card 2  Rank of Card 3  Rank of Card 4  \
0              12              11               2               3   
1               5               5               8               6   
2               2               3               3              11   
3               6               9               2              12   
4               3               9               9               3   

   Rank of Card 5   Hand Type  
0              10   High Card  
1               2    One Pair  
2               5    One Pair  
3              12    One Pair  
4               3  Full House  


In [9]:
# Display the unique hand types to verify correctness
print(data_frame['Hand Type'].value_counts())

Hand Type
One Pair           12213
High Card           9700
Flush               7801
Two Pair            2054
Three of a Kind     1347
Full House           174
Straight             101
Four of a Kind        95
Straight Flush        33
Royal Flush            4
Name: count, dtype: int64


#### Adding Hand Strength Feature

To enhance the dataset, we are adding a new feature called `Hand Strength`. This feature assigns a numerical value to each poker hand type, where stronger hands (e.g., "Royal Flush") have higher values and weaker hands (e.g., "High Card") have lower values. This numerical representation can be useful for machine learning models or further analysis.

In [10]:
# Map hand types to numerical strength
hand_strength_mapping = {
    'High Card': 0,
    'One Pair': 1,
    'Two Pair': 2,
    'Three of a Kind': 3,
    'Straight': 4,
    'Flush': 5,
    'Full House': 6,
    'Four of a Kind': 7,
    'Straight Flush': 8,
    'Royal Flush': 9
}

# Add a new column for hand strength
data_frame['Hand Strength'] = data_frame['Hand Type'].map(hand_strength_mapping)

# Verify the new column
print(data_frame[['Hand Type', 'Hand Strength']].head())

    Hand Type  Hand Strength
0   High Card              0
1    One Pair              1
2    One Pair              1
3    One Pair              1
4  Full House              6


#### Adding Actions Based on Hand Strength

We are defining a new column, `Best Move`, which represents the recommended action for the player based on their hand strength. The actions are determined using the following rules:

- **Fold**: For very weak hands (e.g., `Hand Strength` ≤ 1).
- **Check**: For weak hands when no bet has been made (e.g., `Hand Strength` = 1).
- **Call**: For medium-strength hands (e.g., `Hand Strength` between 2 and 5).
- **Raise**: For strong hands (e.g., `Hand Strength` between 6 and 8).
- **All-In**: For the strongest hands (e.g., `Hand Strength` = 9, "Royal Flush").

This column will be used as the target variable for training a machine learning model to predict the best move.

In [11]:
# Define action percentages based on hand strength
def action_percentages(row):
    if row['Hand Strength'] == 9:  # Royal Flush
        return {'Fold': 0.0, 'Check': 0.0, 'Call': 0.0, 'Raise': 0.2, 'All-In': 0.8}
    elif row['Hand Strength'] >= 6:  # Strong hands (Full House or better)
        return {'Fold': 0.0, 'Check': 0.0, 'Call': 0.1, 'Raise': 0.7, 'All-In': 0.2}
    elif row['Hand Strength'] >= 2:  # Medium-strength hands
        return {'Fold': 0.1, 'Check': 0.1, 'Call': 0.6, 'Raise': 0.2, 'All-In': 0.0}
    elif row['Hand Strength'] == 1:  # Weak hands but not folding
        return {'Fold': 0.3, 'Check': 0.5, 'Call': 0.2, 'Raise': 0.0, 'All-In': 0.0}
    else:  # Very weak hands
        return {'Fold': 0.8, 'Check': 0.2, 'Call': 0.0, 'Raise': 0.0, 'All-In': 0.0}

# Apply the function to calculate percentages for each action
action_columns = ['Fold', 'Check', 'Call', 'Raise', 'All-In']
data_frame[action_columns] = data_frame.apply(lambda row: pd.Series(action_percentages(row)), axis=1)

# Verify the new columns
print(data_frame[['Hand Type', 'Hand Strength'] + action_columns].head())

    Hand Type  Hand Strength  Fold  Check  Call  Raise  All-In
0   High Card              0   0.8    0.2   0.0    0.0     0.0
1    One Pair              1   0.3    0.5   0.2    0.0     0.0
2    One Pair              1   0.3    0.5   0.2    0.0     0.0
3    One Pair              1   0.3    0.5   0.2    0.0     0.0
4  Full House              6   0.0    0.0   0.1    0.7     0.2


#### Adding the `Max Rank Frequency` Feature

The `Max Rank Frequency` feature identifies how many times the most frequent rank appears in the hand. This feature is useful for distinguishing between hands with repeated ranks (e.g., "Four of a Kind," "Three of a Kind," "One Pair") and hands with unique ranks (e.g., "Straight," "High Card").

- **Why it helps**: 
  - Helps the model understand the structure of the hand.
  - Provides critical information for evaluating hands with repeated ranks.

- **How it works**:
  - For each hand, count how many times each rank appears.
  - Take the maximum count as the `Max Rank Frequency`.

- **Examples**:
  - Hand: `[2, 2, 2, 5, 7]` → `Max Rank Frequency = 3` (Three of a Kind).
  - Hand: `[3, 3, 5, 5, 9]` → `Max Rank Frequency = 2` (Two Pair).
  - Hand: `[4, 6, 8, 10, 12]` → `Max Rank Frequency = 1` (High Card or Straight).

In [14]:
# Calculate the maximum frequency of any rank in the hand
data_frame['Max Rank Frequency'] = data_frame[['Rank of Card 1', 'Rank of Card 2', 'Rank of Card 3', 'Rank of Card 4', 'Rank of Card 5']].apply(
    lambda row: max([list(row).count(rank) for rank in set(row)]), axis=1
)

# Verify the new column
print(data_frame[['Rank of Card 1', 'Rank of Card 2', 'Rank of Card 3', 'Rank of Card 4', 'Rank of Card 5', 'Max Rank Frequency']].head())

   Rank of Card 1  Rank of Card 2  Rank of Card 3  Rank of Card 4  \
0              12              11               2               3   
1               5               5               8               6   
2               2               3               3              11   
3               6               9               2              12   
4               3               9               9               3   

   Rank of Card 5  Max Rank Frequency  
0              10                   1  
1               2                   2  
2               5                   2  
3              12                   2  
4               3                   3  


#### Save the wrangled and engineered data to CSV

In [18]:
data_frame.to_csv('model_ready_data.csv', index=False)