# Objective
The goal of Milestone 3 is to generate code to randomly remove some data to replicate incomplete data, and then determine the accuracy of the coding to determine the opening move with missing data. This code begins by randomly generating the data removed and then applying the pattern recognition framework to assess its accuracy in identifying chess openings from incomplete game records.

*Note: 'test_pattern_recognition_v2.ipynb' contains and utilises training and testing data to complete objective. This code loops around all losses and gives accuracy for all.*

# Step 1: Import Libraries and Setup

This section imports the necessary libraries for data manipulation and visualisation. 

The sys library is used to modify the system path to include the directory where the ChessOpeningMapper module is located.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import zipfile
import os
import numpy as np
import random
from sklearn.model_selection import train_test_split



# Import ChessOpeningMapper
from ChessOpeningMapper import ChessOpeningMapper

In [2]:
import os
print("Current Working Directory:", os.getcwd())


Current Working Directory: c:\Users\clancyam\Documents\GitHub\UniSA_ICT_2024_SP4_P3_\Sprint 4


# Step 2: Load Opening Moves and Create Trie Structure.
# 
In this step, an instance of ChessOpeningMapper is created.

A list of file paths to the TSV files containing chess openings is defined.

These TSV files are merged into a single DataFrame using merge_tsv_files.

The PGN strings are split into individual moves using split_pgn_to_columns.

A Trie structure is created from the opening moves using create_trie.

In [3]:
# Create an instance of ChessOpeningMapper
mapper = ChessOpeningMapper()

# Define a list of file paths to the TSV files containing chess openings, I had issues with the path, so I have mapped them manually. 
file_list = [
    '../Chess Pattern Recognition/a.tsv',
    '../Chess Pattern Recognition/b.tsv',
    '../Chess Pattern Recognition/c.tsv',
    '../Chess Pattern Recognition/d.tsv',
    '../Chess Pattern Recognition/e.tsv'
]

# Merge the TSV files into a single DataFrame
opening_moves = mapper.merge_tsv_files(file_list)

# Split the PGN strings into individual moves
opening_moves = mapper.split_pgn_to_columns(opening_moves)

# Create a Trie structure from the opening moves
mapper.create_trie(opening_moves)

# Display the first few rows of the opening moves DataFrame
print("Opening Moves DataFrame:")
print(opening_moves.head())

Opening Moves DataFrame:
   eco                                     name  \
0  A00                             Amar Opening   
1  A00               Amar Opening: Paris Gambit   
2  A00  Amar Opening: Paris Gambit, Gent Gambit   
3  A00                         Amsterdam Attack   
4  A00                      Anderssen's Opening   

                                                 pgn Move_ply_1 Move_ply_2  \
0                                             1. Nh3        Nh3       None   
1                           1. Nh3 d5 2. g3 e5 3. f4        Nh3         d5   
2  1. Nh3 d5 2. g3 e5 3. f4 Bxh3 4. Bxh3 exf4 5. ...        Nh3         d5   
3             1. e3 e5 2. c4 d6 3. Nc3 Nc6 4. b3 Nf6         e3         e5   
4                                              1. a3         a3       None   

  Move_ply_3 Move_ply_4 Move_ply_5 Move_ply_6 Move_ply_7  ... Move_ply_28  \
0       None       None       None       None       None  ...        None   
1         g3         e5         f4       None

# Step 3: Unzip and Load Chess Game Data
This step involves:

Defining the path to the zipped game data file.

Unzipping the game data file to extract the CSV file.

Loading the extracted CSV file into a DataFrame.

In [4]:
# Define the path to the zipped game data file
game_data_zip_path = '../Chess Pattern Recognition/chessdata.zip'

# Define the name of the extracted CSV file
extracted_file_name = 'chessdata.csv'

# Unzip the game data file
ChessOpeningMapper.unzip_game_data(zip_path=game_data_zip_path, extract_to='.')

# Load the extracted CSV file into a DataFrame
game_data = ChessOpeningMapper.load_game_data(zip_path=game_data_zip_path, extracted_file_name=extracted_file_name)

# Display the first few rows of the game data DataFrame
print("Game Data DataFrame:")
print(game_data.head())


  game_data = pd.read_csv(extracted_file_name)


Game Data DataFrame:
   Index        Date  ECO                                 Opening Result  \
0      0  2019.04.30  B15                       Caro-Kann Defense    0-1   
1      1  2019.04.30  C50                            Italian Game    0-1   
2      2  2019.04.30  C41                     Philidor Defense #2    1-0   
3      3  2019.04.30  B06                          Modern Defense    0-1   
4      4  2019.04.30  B32  Sicilian Defense: Loewenthal Variation    1-0   

  Termination TimeControl     UTCDate   UTCTime Move_ply_1  ... Clock_ply_192  \
0      Normal       300+3  2019.04.30  22:00:24         d4  ...           NaN   
1      Normal       300+0  2019.04.30  22:00:13         e4  ...           NaN   
2      Normal       600+0  2019.04.30  22:00:41         e4  ...           NaN   
3      Normal        60+0  2019.04.30  22:00:43         e4  ...           NaN   
4      Normal       180+0  2019.04.30  22:00:46         e4  ...           NaN   

  Clock_ply_193 Clock_ply_194 Clock

# Step 4: Map Opening Names to Game Data

Here:

The game data is processed to map the move sequences to opening names using get_opening_name_from_game.

The mapped opening names are added to the original game data DataFrame in a new column called mapped_opening.


In [5]:
# Map the opening names to the game data
result_df = mapper.get_opening_name_from_game(game_data)

# Add the mapped opening names to the original game data DataFrame
game_data['mapped_opening'] = result_df['opening_name']

# Display the first 5 rows of the updated game data DataFrame
print(f"First 5 rows with mapped openings: \n{game_data.head()}")

First 5 rows with mapped openings: 
   Index        Date  ECO                                 Opening Result  \
0      0  2019.04.30  B15                       Caro-Kann Defense    0-1   
1      1  2019.04.30  C50                            Italian Game    0-1   
2      2  2019.04.30  C41                     Philidor Defense #2    1-0   
3      3  2019.04.30  B06                          Modern Defense    0-1   
4      4  2019.04.30  B32  Sicilian Defense: Loewenthal Variation    1-0   

  Termination TimeControl     UTCDate   UTCTime Move_ply_1  ... Clock_ply_193  \
0      Normal       300+3  2019.04.30  22:00:24         d4  ...           NaN   
1      Normal       300+0  2019.04.30  22:00:13         e4  ...           NaN   
2      Normal       600+0  2019.04.30  22:00:41         e4  ...           NaN   
3      Normal        60+0  2019.04.30  22:00:43         e4  ...           NaN   
4      Normal       180+0  2019.04.30  22:00:46         e4  ...           NaN   

  Clock_ply_194 Cloc

# Step 5: Train/Test Split

Here: 

Split the dataset into training and test sets with a 70/30 ratio. Display the sizes of the train and test datasets to verify the split.

In [6]:
# Step 4: Train/Test Split

# Filter out openings that appear less than twice
value_counts = game_data['mapped_opening'].value_counts()
to_keep = value_counts[value_counts > 1].index
game_data = game_data[game_data['mapped_opening'].isin(to_keep)]

# Split the dataset into training and test sets (70/30) and stratify for equal representation of openings in each split
train_df, test_df = train_test_split(game_data, test_size=0.3, random_state=42,stratify=game_data['mapped_opening'])

# Display the sizes of the train and test datasets
print(f"Training data size: {len(train_df)}")
print(f"Testing data size: {len(test_df)}")

Training data size: 139777
Testing data size: 59905


In [7]:
print("Original Distribution:\n", game_data['mapped_opening'].value_counts(normalize=True))
print("Training Distribution:\n", train_df['mapped_opening'].value_counts(normalize=True))
print("Testing Distribution:\n", test_df['mapped_opening'].value_counts(normalize=True))

Original Distribution:
 mapped_opening
Queen's Pawn Game                                                 0.025491
Horwitz Defense                                                   0.021334
Philidor Defense                                                  0.019336
Queen's Pawn Game: Accelerated London System                      0.018444
Van't Kruijs Opening                                              0.016712
                                                                    ...   
Queen's Gambit Accepted: Alekhine Defense, Haberditz Variation    0.000010
Borg Defense: Zilbermints Gambit                                  0.000010
King's Gambit Accepted: Bishop's Gambit, Lopez Variation          0.000010
Semi-Slav Defense: Marshall Gambit, Main Line                     0.000010
English Opening: King's English Variation, Taimanov Variation     0.000010
Name: proportion, Length: 1642, dtype: float64
Training Distribution:
 mapped_opening
Queen's Pawn Game                                 

# Step 6: Generate Random Code

Here:

A method is defined to generate a random percentage of data poiints across the move plies to be removed

it returns the modified dataset with the removed values and a dictionary of the lost datapoints

It randomly removes data points by taking the raw data, the percentage wish to be removed and the total plies across the percentage of moves to be removed

It then randomly picks a row in the dataset and a corresponding column that is the move plies within the set limit and forms the indice to be removed.

The current test only works on 1% of the data across 50 plies to minimise compute time while testing.
it the prints out the total number of row indices removed from each column.


In [8]:
# Generate Random Code and Apply to Both Datasets
def generate_random_data_removal(data, percent_remove=5, total_plies=100):
    """
    Randomly removes a specified percentage of data points from the first 'total_plies' columns of a DataFrame,
    and returns a dictionary of the indices removed.

    Args:
    - data (pd.DataFrame): The DataFrame from which to remove data.
    - percent_remove (int): The percentage of data points to remove.
    - total_plies (int): The total number of plies to consider for data removal.

    Returns:
    - pd.DataFrame: A DataFrame with randomly removed data points.
    - dict: A dictionary containing the indices of the data points that were removed.
    """
    modified_data = data.copy()
    loss_dict = {}

    # Calculate the total number of data points to remove
    num_data_points = int((total_plies * len(data)) * (percent_remove / 100))
    
    # Generate random row and column indices to set to None
    row_indices = np.random.randint(0, len(data), num_data_points)
    col_indices = np.random.randint(0, total_plies, num_data_points)  
    
    # Set the selected data points to None and record the indices in loss_dict
    for row, col in zip(row_indices, col_indices):
        modified_data.iloc[row, col] = None
        if col in loss_dict:
            loss_dict[col].append(row)
        else:
            loss_dict[col] = [row]



    return modified_data, loss_dict



# Apply random move removal to training and testing datasets for all losses
train_datasets, train_loss_dict = generate_random_data_removal(train_df,percent_remove=1,total_plies = 100)
test_datasets, test_loss_dict = generate_random_data_removal(test_df,percent_remove=1, total_plies=100)

print("Sample of training data after random move removal:")
print(train_df.head())

print("Sample of testing data after random move removal:")
print(test_df.head())

# test example of 1% of data removed across 50 plies of the training data
for column, rows in train_loss_dict.items():
    print(f"Column {column}: {len(rows)} rows removed")

Sample of training data after random move removal:
         Index        Date  ECO  \
160507  160507  2019.05.17  C45   
156024  156024  2019.05.17  C45   
42935    42935  2019.05.03  B12   
29007    29007  2019.05.27  B10   
172373  172373  2019.05.18  C55   

                                                  Opening Result  \
160507                                        Scotch Game    0-1   
156024                   Scotch Game: Classical Variation    0-1   
42935   Caro-Kann Defense: Advance Variation, Short Va...    1-0   
29007                                   Caro-Kann Defense    1-0   
172373  Italian Game: Two Knights Defense, Max Lange A...    1-0   

         Termination TimeControl     UTCDate   UTCTime Move_ply_1  ...  \
160507        Normal      900+15  2019.05.17  19:16:57         e4  ...   
156024        Normal       180+0  2019.05.17   2:48:19         e4  ...   
42935   Time forfeit       180+0  2019.05.03  23:42:29         e4  ...   
29007         Normal      900+15 

**Step 6.1 Testing**
This tests that the data that has been removed is exclusivly from the plies columns

In [13]:
def print_removed_values(data, loss_dict, max_samples=20):
    total_displayed = 0  # Counter to track how many values have been displayed

    for column, rows in loss_dict.items():
        column_name = f'Move_ply_{column+1}'  
        if len(rows) > 0:
            original_values = data.loc[rows, column_name]  
            sample_size = min(len(rows), max_samples - total_displayed)  
            print(f"Original values removed from Column {column_name}:")
            print(original_values.sample(sample_size))  

            total_displayed += sample_size  # Update the counter
            if total_displayed >= max_samples:
                break  # Stop if the maximum number of samples has been reached

#Resets the index of the game data dataframe
game_data.reset_index(drop=True, inplace=True)
# Example usage
print_removed_values(game_data, train_loss_dict)

Original values removed from Column Move_ply_51:
106269     Qh4
109039     Nf5
78860      NaN
9905       NaN
52782     Nxe5
99548      NaN
124209     NaN
73723      NaN
110753    Rxf1
43725      NaN
47855      Qc3
4926      dxe5
93795      Qh6
53368       g4
113230     Nh4
95548     Rhg1
57368      NaN
76746     Bxh3
109425     Kg2
102291    gxf4
Name: Move_ply_51, dtype: object


# Step 7: Prepare Data for Testing

In this step, game data is processed to create sequences from non-null moves, preparing it for pattern recognition testing. This step ensures each game's moves are consolidated into a format suitable for analysis.

In [14]:
# Prepare data by creating move sequences from non-null moves
def prepare_game_data_for_testing(game_data, max_plies=200):
    
    """
    Prepare the game data by creating move sequences from non-null moves.
    
    Args:
    - game_data (pd.DataFrame): The DataFrame with chess moves.
    - max_plies (int): The maximum number of plies to consider (default is 200).
    
    Returns:
    - pd.DataFrame: A DataFrame with a new column 'move_sequence' containing sequences of moves.
    """
    
    move_columns = [f'Move_ply_{i+1}' for i in range(max_plies)]
    selected_moves_df = game_data[move_columns].copy()

    def create_move_sequence(row):
        moves = row.dropna().tolist()  # Drop NaN values and convert to list
        return ' '.join(moves)

    selected_moves_df['move_sequence'] = selected_moves_df.apply(create_move_sequence, axis=1)
    selected_moves_df['mapped_opening'] = game_data['mapped_opening']  
    return selected_moves_df

# Prepare the incomplete game data for testing
# For simplicity, let's use the first modified dataset for further processing
processed_train_data_sample = prepare_game_data_for_testing(train_datasets)
processed_test_data_sample = prepare_game_data_for_testing(test_datasets)

In [15]:
processed_test_data_sample.head()

Unnamed: 0,Move_ply_1,Move_ply_2,Move_ply_3,Move_ply_4,Move_ply_5,Move_ply_6,Move_ply_7,Move_ply_8,Move_ply_9,Move_ply_10,...,Move_ply_193,Move_ply_194,Move_ply_195,Move_ply_196,Move_ply_197,Move_ply_198,Move_ply_199,Move_ply_200,move_sequence,mapped_opening
76403,e4,e5,Nf3,Nc6,Bb5,d6,Bxc6+,bxc6,d4,exd4,...,,,,,,,,,e4 e5 Nf3 Nc6 Bb5 d6 Bxc6+ bxc6 d4 exd4 Nxd4 N...,Ruy Lopez: Steinitz Defense
48651,c4,e6,Nc3,d5,d4,dxc4,e4,Nf6,Bxc4,c5,...,,,,,,,,,c4 e6 Nc3 d5 d4 dxc4 e4 Nf6 Bxc4 c5 d5 exd5 ex...,English Opening: Agincourt Defense
100696,e4,e6,d4,d5,Nc3,Bb4,Bd3,Nf6,e5,Nfd7,...,,,,,,,,,e4 e6 d4 d5 Nc3 Bb4 Bd3 Nf6 e5 Nfd7 Qg4 O-O Bh...,French Defense: Winawer Variation
2971,d4,Nf6,c4,e6,Nf3,d5,cxd5,exd5,Bg5,c6,...,,,,,,,,,d4 Nf6 c4 e6 Nf3 d5 cxd5 exd5 Bg5 c6 e3 Bd6 Nc...,Indian Defense: Anti-Nimzo-Indian
110570,e4,e6,Nf3,b6,d4,Bb7,Bd3,f5,Qe2,Nf6,...,,,,,,,,,e4 e6 Nf3 b6 d4 Bb7 Bd3 f5 Qe2 Nf6 exf5 Bd6 fx...,French Defense: Knight Variation


# Step 8: Test Pattern Recognition on Incomplete Data

In this step, the ChessOpeningMapper is used to predict chess openings from the processed, incomplete data. The predicted openings are then compared with the actual mapped openings to determine the accuracy of the model, assessing its ability to handle missing data.

In [16]:
# Test pattern recognition on incomplete data
def test_pattern_recognition_on_incomplete_data(mapper, incomplete_df, dataset_label):
    
    """
    Test the pattern recognition on incomplete data using ChessOpeningMapper.
    
    Args:
    - mapper (ChessOpeningMapper): The ChessOpeningMapper instance for mapping openings.
    - incomplete_df (pd.DataFrame): The DataFrame containing incomplete chess move data.
    - dataset_label (str): Label to indicate which dataset is being tested (e.g., 'Training' or 'Testing').
    
    Returns:
    - pd.DataFrame: A DataFrame with the original and predicted openings for comparison.
    """
    
    # Use the mapper to predict openings based on incomplete data
    incomplete_results = mapper.get_opening_name_from_game(incomplete_df)
    incomplete_df['predicted_opening'] = incomplete_results['opening_name']
    
    if 'mapped_opening' in incomplete_df.columns:
        incomplete_df['match'] = incomplete_df['mapped_opening'] == incomplete_df['predicted_opening']
        accuracy = incomplete_df['match'].mean()
        print(f'Accuracy on incomplete {dataset_label} data: {accuracy:.2%}')
    else:
        print(f"No 'mapped_opening' column found in input {dataset_label} data")
    
    return incomplete_df

# Instantiate ChessOpeningMapper and load trie
mapper = ChessOpeningMapper()
opening_moves = mapper.load_openings()
mapper.create_trie(opening_moves)

# Run the test on both training and testing data
train_result_df_test = test_pattern_recognition_on_incomplete_data(mapper, processed_train_data_sample, "training")
test_result_df_test = test_pattern_recognition_on_incomplete_data(mapper, processed_test_data_sample, "testing")


# Print results for training data
#print("\nTraining Data Results:")
#print("-" * 50)
#if 'mapped_opening' in train_result_df.columns and 'match' in train_result_df.columns:
 #   print(train_result_df[['mapped_opening', 'predicted_opening', 'match']].head())
#else:
#    print("Mapped or Match columns not found in the train result DataFrame")

Accuracy on incomplete training data: 96.10%
Accuracy on incomplete testing data: 95.99%


In [18]:
# # Test pattern recognition on all losses for both train and test datasets

### Note this will take a long time as it calculates the lost for 1% of data across 50 plies, suggest just running for 10 minutes
### or so just to get an idea that the ouput is working.
results = {}
for loss_key in train_loss_dict.keys():
    

    # Unpack the tuple returned by generate_random_data_removal
    modified_train_data, _ = generate_random_data_removal(train_datasets, 1)  # Assuming 1 is the loss percentage
    modified_test_data, _ = generate_random_data_removal(test_datasets, 1)

    # Proceed with data processing
    processed_train_data = prepare_game_data_for_testing(modified_train_data)
    processed_test_data = prepare_game_data_for_testing(modified_test_data)

    # Pattern recognition and results storage
    train_result_df = test_pattern_recognition_on_incomplete_data(mapper, processed_train_data, f"training ({loss_key})")
    test_result_df = test_pattern_recognition_on_incomplete_data(mapper, processed_test_data, f"testing ({loss_key})")

    results[loss_key] = {
        'train': train_result_df,
        'test': test_result_df
     }

Accuracy on incomplete training (50) data: 92.32%
Accuracy on incomplete testing (50) data: 92.27%


KeyboardInterrupt: 