[Home](../../README.md)

### Feature Engineering

This Jupyter Notebook demonstrates various feature engineering processes you can apply to your poker dataset to improve the performance of your machine learning model. For this demonstration, we will engineer new or improved features for the poker hand dataset you previously wrangled.

#### Feature Engineering Process
- **Deriving New Variables from Existing Ones**:
    - Encoding categorical features (e.g., encoding suits numerically or one-hot encoding them).
    - Combining ranks and suits into a single feature (e.g., encoding a card as a unique value based on its rank and suit).
    - Calculating new features, such as the total rank sum, the number of cards with the same rank, or the number of cards with the same suit.

- **Combining Features/Feature Interactions**:
    - Creating features that represent poker hand strength (e.g., checking for pairs, flushes, straights, etc.).
    - Identifying patterns in the cards, such as the highest card rank or the number of consecutive ranks.

- **Identifying the Most Relevant Features for the Model**:
    - Using feature selection techniques to identify the most important features for predicting the poker hand.

- **Transforming Features**:
    - Normalizing or scaling numerical features (e.g., scaling ranks and suits to a range of 0 to 1).
    - Applying mathematical transformations (e.g., logarithmic transformations) if necessary to reduce skewness or improve feature distribution.

- **Creating Domain-Specific Features**:
    - Incorporating poker-specific knowledge to create features that capture important characteristics of the data, such as:
        - Whether the hand contains a flush (all cards of the same suit).
        - Whether the hand contains a straight (consecutive ranks).
        - The number of pairs, three-of-a-kinds, or four-of-a-kinds in the hand.

#### Load the required dependencies

In [5]:
# Import frameworks
import pandas as pd

####  Store the data as a local variable

The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.

In [6]:
data_frame = pd.read_csv("poker_hand_dataset_wrangled_data.csv")

####  Deriving new variables from existing ones

##### Encoding Categorical Variables

Data encoding converts categorical data, such as suits of cards, into numerical format so that it can be used as input for machine learning algorithms. Most machine learning algorithms work with numbers and not with text or categorical variables.

In this dataset, the suits of cards (`Suit of Card 1`, `Suit of Card 2`, etc.) are already represented numerically (e.g., 1 for Clubs, 2 for Diamonds, 3 for Hearts, 4 for Spades). However, if needed, we can further encode these values into other formats, such as one-hot encoding, to make them more suitable for certain algorithms.

For example:
- **Label Encoding**: Suits are already encoded as integers (1 to 4).
- **One-Hot Encoding**: Each suit can be represented as a binary vector (e.g., `[1, 0, 0, 0]` for Clubs).


In [7]:
# Count the number of unique ranks in the hand
data_frame['Unique Ranks'] = data_frame[['Rank of Card 1', 'Rank of Card 2', 'Rank of Card 3', 'Rank of Card 4', 'Rank of Card 5']].apply(lambda row: len(set(row)), axis=1)

# Count the number of unique suits in the hand
data_frame['Unique Suits'] = data_frame[['Suit of Card 1', 'Suit of Card 2', 'Suit of Card 3', 'Suit of Card 4', 'Suit of Card 5']].apply(lambda row: len(set(row)), axis=1)

##### Determining Poker Hand Type

In poker, the type of hand (e.g., "Flush", "Straight", "Full House") determines its strength. This feature is critical for understanding the dataset and making predictions. We will calculate the type of poker hand based on the ranks and suits of the cards in the hand. The hand type will be added as a new column called `Hand Type`.

In [8]:
# Function to determine the type of poker hand
def determine_hand_type(row):
    ranks = sorted([row['Rank of Card 1'], row['Rank of Card 2'], row['Rank of Card 3'], row['Rank of Card 4'], row['Rank of Card 5']])
    suits = [row['Suit of Card 1'], row['Suit of Card 2'], row['Suit of Card 3'], row['Suit of Card 4'], row['Suit of Card 5']]
    
    # Check for Flush (all suits are the same)
    is_flush = len(set(suits)) == 1
    
    # Check for Straight (consecutive ranks)
    is_straight = ranks == list(range(ranks[0], ranks[0] + 5))
    
    # Count occurrences of each rank
    rank_counts = {rank: ranks.count(rank) for rank in set(ranks)}
    rank_count_values = sorted(rank_counts.values(), reverse=True)
    
    # Determine hand type
    if is_flush and is_straight and ranks[-1] == 13:  # Royal Flush
        return 'Royal Flush'
    elif is_flush and is_straight:  # Straight Flush
        return 'Straight Flush'
    elif rank_count_values == [4, 1]:  # Four of a Kind
        return 'Four of a Kind'
    elif rank_count_values == [3, 2]:  # Full House
        return 'Full House'
    elif is_flush:  # Flush
        return 'Flush'
    elif is_straight:  # Straight
        return 'Straight'
    elif rank_count_values == [3, 1, 1]:  # Three of a Kind
        return 'Three of a Kind'
    elif rank_count_values == [2, 2, 1]:  # Two Pair
        return 'Two Pair'
    elif rank_count_values == [2, 1, 1, 1]:  # One Pair
        return 'One Pair'
    else:  # High Card
        return 'High Card'

# Apply the function to determine the hand type
data_frame['Hand Type'] = data_frame.apply(determine_hand_type, axis=1)

# Display the first few rows to verify the new feature
print(data_frame[['Rank of Card 1', 'Rank of Card 2', 'Rank of Card 3', 'Rank of Card 4', 'Rank of Card 5',
                  'Suit of Card 1', 'Suit of Card 2', 'Suit of Card 3', 'Suit of Card 4', 'Suit of Card 5',
                  'Hand Type']].head())

TypeError: 'numpy.float64' object cannot be interpreted as an integer

           DoB        DoT   Age
0   1980-07-26 2024-10-18  44.0
1   1975-01-21 2024-01-06  49.0
2   1983-02-10 2024-02-01  41.0
3   1985-04-01 2024-05-09  39.0
4   1964-02-17 2024-03-25  60.0
..         ...        ...   ...
434 1959-02-11 2024-02-07  65.0
435 1966-06-09 2024-09-10  58.0
436 1966-08-26 2024-11-27  58.0
437 1989-08-19 2024-08-04  35.0
438 2024-04-13 2024-04-07  -0.0

[439 rows x 3 columns]


#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

In this, case some domain knowledge and data analysis have informed you that the BMI and AGE are risk multipliers (the greater the age and the greater the BMI the greater the feature). From this we can  risk value based on the feature interactions.

In [5]:
# Calculate the year difference and round to an integer
data_frame['Age'] = ((data_frame['DoT'] - data_frame['DoB']).dt.days / 365.25).round().astype(int)

# Create the 'Risk' column
data_frame['Risk'] = data_frame['BMI'] * data_frame['Age']

# Calculate the percentage of the maximum risk
data_frame['Risk%'] = (data_frame['Risk'] / data_frame['Risk'].max()).round(2)

# Print the result
print(data_frame[['Age', 'BMI', 'Risk%']])

     Age       BMI  Risk%
0     44  0.346667   0.30
1     49  0.353333   0.34
2     41  0.183333   0.15
3     39  0.206667   0.16
4     60  0.353333   0.42
..   ...       ...    ...
434   65  0.616667   0.79
435   58  0.723333   0.83
436   58  0.766667   0.88
437   35  0.876667   0.61
438    0  0.416667   0.00

[439 rows x 3 columns]


#### Transforming Features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the 'SEX' is 'Male'. There is no method call for this, we can just use conditional indexing to fulfil our purpose.

In this, case some domain knowledge and data analysis have informed you that there is 'bimodality' in the data and males and females have a different trends. 

In [6]:
# Filter the data to -1 only
data_frame = data_frame[data_frame['SEX'] == -1]

# Print the result
print(data_frame[['Age', 'SEX', 'Target']])

     Age  SEX  Target
0     44   -1    25.0
1     49   -1    31.0
2     41   -1    37.0
4     60   -1    39.0
5     48   -1    40.0
..   ...  ...     ...
425   67   -1   303.0
426   43   -1   306.0
430   29   -1   310.0
432   41   -1   317.0
437   35   -1   346.0

[234 rows x 3 columns]


#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. In This case the domain is 'health' and more specifically   'Epidemiology' which is the study of how often diseases occur in different groups of people and why.

The column called '1st Degree Relatives' is a domain specific feature as is records the number of family members in the individuals direct bloodline who have developed type 2 adult onset diabetes. Domain specific knowledge, is that Family history of disease in first degree relatives is a major risk factor, especially for premature events.

First we will convert we will convert the FDR value to a risk percentage, because the risk can never be 0 (will never happen) or 100% (will definitely happen) we will scale the result between 0.15 and 0.95.

In [7]:
# Calculate the family history risk
data_frame['FHRisk'] = (data_frame['FDR'] / data_frame['FDR'].max())

# Scale the result between 0.15 and 0.95
min_val = 0.15
max_val = 0.85
data_frame['FHRisk'] = (((data_frame['FHRisk'] - data_frame['FHRisk'].min()) / (data_frame['FHRisk'].max() - data_frame['FHRisk'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['Age', 'FDR', 'FHRisk']])

     Age  FDR  FHRisk
0     44    0    0.15
1     49    1    0.38
2     41    1    0.38
4     60    0    0.15
5     48    0    0.15
..   ...  ...     ...
425   67    1    0.38
426   43    1    0.38
430   29    2    0.62
432   41    1    0.38
437   35    3    0.85

[234 rows x 3 columns]


Then to make it even more meaningful, we will combine it with the `Risk` feature we engineered using the `AGE` and `BMI` features to create a combined risk 'interaction feature' that captures real-world relationships between the features.

Again we will scale the result between 0.15 and 0.95.

In [None]:
data_frame['CombRisk'] = (data_frame['FHRisk'] * data_frame['Risk%']).round(2)

min_val = 0.15
max_val = 0.85
data_frame['CombRisk'] = (((data_frame['CombRisk'] - data_frame['CombRisk'].min()) / (data_frame['CombRisk'].max() - data_frame['CombRisk'].min())) * (max_val - min_val) + min_val).round(2)

# Print the result
print(data_frame[['Age', 'Risk%', 'FHRisk', 'CombRisk']])

#### Save the wrangled and engineered data to CSV

In [8]:
data_frame.to_csv('../2.3.Model_Training/2.3.1.model_ready_data.csv', index=False)