
# Introduction to Audio Exploratory Data Analysis
## Part 2: Duplicate Hunting

<b> By Daniel Gladman, December 2022 </b>

Now that we have some data and some metadata, we should be good to go creating a model right? Well, no so fast.

The data that I scrapped seems pretty good at face value, and after picking a few files at random and listening to them, they seemed to be ok. But there are nearly 900 files in the data, and I don't really want to sit around listening to animal noises over and over, even if the durations are quite short.

So, how can I know that the data is trustworthy if I am unwilling to sit and validate each audio file? Well, I cannot. Not completely anyway. But there are a few things that I can do to check the integrity of the data. First off, I can analyse the audio data and use python to tell me whether any files are duplicated in the data.

In [1]:
#Import the needed libraries

import IPython.display as ipd
import librosa
import numpy as np
import pandas as pd


### Demonstrating how the algorithm works.

First off, the solution I am providing is extremely poor in terms of readability and code hygiene. It definitely needs refactoring. However, I will break down and explain my algorithm so that you may come up with a better and more optimal solution.

First, we will read in the metadata file. I decided to drop any files with a duration of "0" as even though technically they are not likely to be zero, they are probably too short to be useful anyway.



In [2]:
df = pd.read_csv("./Data/metadata.csv") # You may need to adjust the path to find the Data folder

# Drop zero files
indexZero = df[ df['seconds'] == 0 ].index
df.drop(indexZero , inplace=True)

df.head()


Unnamed: 0.1,Unnamed: 0,filename,seconds,class
0,0,bird_1.wav,1,bird
1,1,bird_10.wav,1,bird
2,2,bird_100.wav,3,bird
3,3,bird_101.wav,1,bird
4,4,bird_102.wav,3,bird


Next we will create some groups based on the individual classes and durations. The reason for doing so is that ideally we would like to break this dataframe down into smaller dataframes before we start reading in the audio.

I suppose I could have simply just loaded in all the audio to begin with and then created subgroups for comparison, but doing so perhaps requires more memory to store the audio data. With this size of a dataset it probably doesn't matter, but if we were dealing with millions of files, it would definitely matter.


In [3]:
g = df.groupby('class')['seconds'].apply(lambda x: np.unique(x))
classes = g.index
uni_durations = g.values

sub_dfs = []
for cls, dur in zip(classes, uni_durations):
        for d in dur:
            sub_df = df.loc[(df["class"] == cls) & (df["seconds"] == d)]
            sub_dfs.append(sub_df)


print(f'There are {len(sub_dfs)} dataframes.')
print()
print(sub_dfs[0].head(3))
print()
print(sub_dfs[1].head(3))
print()
print(sub_dfs[2].head(3))
print()
print(sub_dfs[91].head(3))

There are 92 dataframes.

   Unnamed: 0      filename  seconds class
0           0    bird_1.wav        1  bird
1           1   bird_10.wav        1  bird
3           3  bird_101.wav        1  bird

    Unnamed: 0      filename  seconds class
11          11  bird_109.wav        2  bird
13          13  bird_110.wav        2  bird
17          17  bird_114.wav        2  bird

    Unnamed: 0      filename  seconds class
2            2  bird_100.wav        3  bird
4            4  bird_102.wav        3  bird
12          12   bird_11.wav        3  bird

     Unnamed: 0      filename  seconds  class
850         850  sheep_23.wav        8  sheep
857         857   sheep_3.wav        8  sheep
862         862  sheep_34.wav        8  sheep


Now that we have a list of dataframes broken up by class and duration, I will demonstrate the algorithm on one of the dataframes first before building up to handle all of the dataframes at once.

First I will load in the audio data and insert it into a new column. I will also remove the index column as it makes the dataframe look messy.

In [4]:
test_df = sub_dfs[0]
audios, srs = [], []
for fn, cls in zip(test_df["filename"], test_df['class']):
        path = "./Data/"  # You may need to adjust the path to find the Data folder
        fp = f"{path}{cls}/{fn}"
        audio, sr = librosa.load(fp, res_type="kaiser_best")
        audios.append(audio)
        srs.append(sr)
test_df['audio'] = audios

test_df.drop('Unnamed: 0', axis=1, inplace=True)

test_df.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['audio'] = audios
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df.drop('Unnamed: 0', axis=1, inplace=True)


Unnamed: 0,filename,seconds,class,audio
0,bird_1.wav,1,bird,"[0.014942548, 0.013527845, -0.0060404763, -0.0..."
1,bird_10.wav,1,bird,"[-0.010408278, -0.022793703, -0.018339803, -0...."
3,bird_101.wav,1,bird,"[-0.0056431275, -0.014416743, -0.024429562, -0..."
5,bird_103.wav,1,bird,"[-0.009546474, -0.008485821, 0.002183585, 0.00..."
6,bird_104.wav,1,bird,"[-0.004036505, 0.010824322, 0.0142896315, -0.0..."


Next I will create a copy of the dataframe just so I can go back and ensure that anything I do doesn't accidently impact the original dataframe.

First I will reset the indices. This is important because I want to create a reference point that I can look up later when I want to pair off any duplicate files.

Then I will use a two pointer algorithm to help me check each audio numpy array against all others in the dataframe. The rules are simple, if after subtracting one array from another and all the numbers in that array equal 0, then the files must have the exact same audio in it. I will sum up the arrays in this instance so that I can validate and mark the file either "Y" for duplicate, or "N" for not duplicate.

In retrospect isn't the best solution as there could be fringe cases where the sum of the difference could be zero for two distinct audio, but the probablility is quite low. In fact there is a better solution, but I will leave it to you to figure it out.

Finally I will take the list of "Y" and "N" labels and append them to a list. My goal is to append each of these lists to new columns for further analysis. But I need to make sure each column has the same length, otherwise my labels will potentially point to a differnt file than originally intended. This is why I use the length of the original audio, subtract the length of the current list, then use the difference to generate some null values that I will append to the start of the list. This way each list will always be the same lenght. And finally, to create some meaningful column names that I can reference later when looking up duplicates, I will ensure that each column name is named after the file row that the list was generate from.

Thats a little difficult to explain, maybe it is easier to look at it.

In [5]:
test_copy = test_df
test_copy = test_copy.reset_index()
test_copy.rename(columns = {'index':'original_index'}, inplace = True)
a = test_copy['audio']
n =  len(a)
l, r = 0, 1

while(r < n):
    x = a.iloc[l]
    y = a.iloc[r:]
    dup= []
    for i in y:
        try:
            if np.sum(x - i) == 0:
                dup.append("Y")
            else:
                dup.append("N")
        except(ValueError):
            dup.append("N")
    
    n_nas = n - len(dup)
    l_nas = [np.nan] * n_nas
    dup = l_nas+dup
    test_copy[f"dup_{l}"] = dup
    l += 1
    r += 1

test_copy.head()


Unnamed: 0,original_index,filename,seconds,class,audio,dup_0,dup_1,dup_2,dup_3,dup_4,...,dup_76,dup_77,dup_78,dup_79,dup_80,dup_81,dup_82,dup_83,dup_84,dup_85
0,0,bird_1.wav,1,bird,"[0.014942548, 0.013527845, -0.0060404763, -0.0...",,,,,,...,,,,,,,,,,
1,1,bird_10.wav,1,bird,"[-0.010408278, -0.022793703, -0.018339803, -0....",N,,,,,...,,,,,,,,,,
2,3,bird_101.wav,1,bird,"[-0.0056431275, -0.014416743, -0.024429562, -0...",N,N,,,,...,,,,,,,,,,
3,5,bird_103.wav,1,bird,"[-0.009546474, -0.008485821, 0.002183585, 0.00...",N,N,N,,,...,,,,,,,,,,
4,6,bird_104.wav,1,bird,"[-0.004036505, 0.010824322, 0.0142896315, -0.0...",N,N,N,N,,...,,,,,,,,,,


See dup_0 refers to the first file bird_1.wav. Of course there is no point checking each file against itself, so by default the first value is NaN. The two pointer algoritm helps me achieve this easily, but it also has another advantage that I will utilize later. Each subsequent value in the dup_0 column points to the file it was checked against. So dup_1 refers to bird_10.wav, the first two entries are NaN because dup_0 aready checked these files (doing so again would provide redundant information), and all subsequent values are the checks between bird_10.wav and all otehr files in the data frame.



But what can we do with this information? There are a long list of columns now and it isn't easy to spot where the duplicates are, if any!

So what do we do? 

We need to filter out all the columns and rows that don't have "Y". Doing this isn't super straight forward and will require some massaging. First I will create a copy of this new df. The copy may be redundant, but I will do so purely for testing purposes. I will need this dataframe so I can refer back to it later when I want to pair a "dup_x" with its corresponding file. You will see why soon.

Next I will create another copy of the dataframe, this one I will apply an 'isin' filter to return only the rows that have a "Y" in any of the columns.

In [6]:
sub_df = test_copy.reset_index()
sub_df.rename(columns = {'index':'dup_index'}, inplace = True)
dup_df = test_copy[test_copy.isin(['Y']).any(axis=1)]
dup_df = dup_df.reset_index()
dup_df.rename(columns = {'index':'dup_index'}, inplace = True)



In [7]:
sub_df.head(3)

Unnamed: 0,dup_index,original_index,filename,seconds,class,audio,dup_0,dup_1,dup_2,dup_3,...,dup_76,dup_77,dup_78,dup_79,dup_80,dup_81,dup_82,dup_83,dup_84,dup_85
0,0,0,bird_1.wav,1,bird,"[0.014942548, 0.013527845, -0.0060404763, -0.0...",,,,,...,,,,,,,,,,
1,1,1,bird_10.wav,1,bird,"[-0.010408278, -0.022793703, -0.018339803, -0....",N,,,,...,,,,,,,,,,
2,2,3,bird_101.wav,1,bird,"[-0.0056431275, -0.014416743, -0.024429562, -0...",N,N,,,...,,,,,,,,,,


In [8]:
dup_df

Unnamed: 0,dup_index,original_index,filename,seconds,class,audio,dup_0,dup_1,dup_2,dup_3,...,dup_76,dup_77,dup_78,dup_79,dup_80,dup_81,dup_82,dup_83,dup_84,dup_85
0,9,36,bird_131.wav,1,bird,"[-0.035403516, -0.009399489, 0.07566398, 0.116...",N,N,N,N,...,,,,,,,,,,
1,10,37,bird_132.wav,1,bird,"[0.008861561, 0.0091502, -0.0032833791, -0.009...",N,N,N,N,...,,,,,,,,,,
2,19,78,bird_17.wav,1,bird,"[0.0016596618, 0.0025352237, -0.00058823754, -...",N,N,N,N,...,,,,,,,,,,
3,21,89,bird_18.wav,1,bird,"[0.00023781689, 0.0006564895, 0.0006253673, -9...",N,N,N,N,...,,,,,,,,,,
4,24,100,bird_19.wav,1,bird,"[0.00041163398, -0.0009818975, -0.0019779226, ...",N,N,N,N,...,,,,,,,,,,
5,29,112,bird_20.wav,1,bird,"[0.0059306254, 0.0038046334, -0.0054974956, -0...",N,N,N,N,...,,,,,,,,,,


This is a bit better but I still cannot see where the "Y" are in these 6 rows. 

What I need to do is find a way to iterate over the columns of each row and extract the column name. But there are some columns that I don't need to iterate over, e.g.  filename and class...

So what do I do?

My solution is to create a custom list of the dup_x columns to use to evaluate each row series of data. I will extract the information I need from each series and then rebuild the dataframe back up with the information I need. This is the best solution because later when I need to automate this over many different dataframes with different numbers of files, i will be able to extract the correct dup_x values no matter what dataframe the algorithm reads.

In [9]:
dup_cols = []
for i in range(l):
    dup_col_name = f"dup_{i}"
    dup_cols.append(dup_col_name)

dups = []
for dup_col in dup_cols:
    s = dup_df.loc[(dup_df[dup_col] == "Y")]
    s = s.loc[:, ['dup_index', 'filename', 'seconds', 'class', dup_col]]
    if len(s) > 0:
        dup_pair = (s['dup_index'].values[0], s[dup_col].name)
        dups.append(dup_pair)
        # print(s['dup_index'].values[0], s[dup_col].name)


dups

[(9, 'dup_7'),
 (10, 'dup_8'),
 (19, 'dup_11'),
 (21, 'dup_12'),
 (24, 'dup_13'),
 (29, 'dup_14')]

The above section of code extracts the index number of the row with contains a duplicated file, and it extracts the dup_x value that the duplicated file corresponds to.

Now we need to put this back together by referencing it back to the original file.

First I will clean up the dataframes by removing all the dup_x values. I will then iterate over each duplicate pair, find the filename for the dup_x and append it to a new list. I've called it p for pairs.

In [10]:
sub_df = sub_df.loc[:, ['dup_index', 'original_index', 'filename', 'seconds', 'class']]
dup_df = dup_df.loc[:, ['dup_index', 'original_index', 'filename', 'seconds', 'class']]

ps = []
for d in dups:  

    di = d[0]
    di_2 = d[1].partition('_')[2]

    p = dup_df.loc[(dup_df['dup_index'] == di)]
    p2= sub_df.loc[(sub_df['dup_index'] == int(di_2))]
    p['dup_filename'] = p2.filename.values[0]
    ps.append(p.values)
  
ps

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p['dup_filename'] = p2.filename.values[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p['dup_filename'] = p2.filename.values[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  p['dup_filename'] = p2.filename.values[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

[array([[9, 36, 'bird_131.wav', 1, 'bird', 'bird_12.wav']], dtype=object),
 array([[10, 37, 'bird_132.wav', 1, 'bird', 'bird_13.wav']], dtype=object),
 array([[19, 78, 'bird_17.wav', 1, 'bird', 'bird_136.wav']], dtype=object),
 array([[21, 89, 'bird_18.wav', 1, 'bird', 'bird_137.wav']], dtype=object),
 array([[24, 100, 'bird_19.wav', 1, 'bird', 'bird_138.wav']], dtype=object),
 array([[29, 112, 'bird_20.wav', 1, 'bird', 'bird_139.wav']], dtype=object)]

Finally I will reconstruct this information into a new dataframe that I can easily join with other dataframes when I eventually run this algorithm over the 92 dataframes I segmented the original metadata file into.

In [11]:
new_rows = []
for p in ps:
    new_row = list(p[0])
    new_rows.append(new_row)

new_df = pd.DataFrame(new_rows, columns=['dup_index', 'original_index', 'filename', 'seconds', 'class', 'duplicate_filename'])
new_df
    

Unnamed: 0,dup_index,original_index,filename,seconds,class,duplicate_filename
0,9,36,bird_131.wav,1,bird,bird_12.wav
1,10,37,bird_132.wav,1,bird,bird_13.wav
2,19,78,bird_17.wav,1,bird,bird_136.wav
3,21,89,bird_18.wav,1,bird,bird_137.wav
4,24,100,bird_19.wav,1,bird,bird_138.wav
5,29,112,bird_20.wav,1,bird,bird_139.wav


Here is a quick sanity check to show that the above code worked as intended. See how the two copied dataframes can be referenced. NOTE the index numbers.

In [12]:
dup_df

Unnamed: 0,dup_index,original_index,filename,seconds,class
0,9,36,bird_131.wav,1,bird
1,10,37,bird_132.wav,1,bird
2,19,78,bird_17.wav,1,bird
3,21,89,bird_18.wav,1,bird
4,24,100,bird_19.wav,1,bird
5,29,112,bird_20.wav,1,bird


In [13]:
sub_df.head(10)

Unnamed: 0,dup_index,original_index,filename,seconds,class
0,0,0,bird_1.wav,1,bird
1,1,1,bird_10.wav,1,bird
2,2,3,bird_101.wav,1,bird
3,3,5,bird_103.wav,1,bird
4,4,6,bird_104.wav,1,bird
5,5,7,bird_105.wav,1,bird
6,6,10,bird_108.wav,1,bird
7,7,23,bird_12.wav,1,bird
8,8,34,bird_13.wav,1,bird
9,9,36,bird_131.wav,1,bird


In [14]:
sub_df.loc[(sub_df["dup_index"] == 9)]

Unnamed: 0,dup_index,original_index,filename,seconds,class
9,9,36,bird_131.wav,1,bird


In [15]:
dup_df.loc[(dup_df["dup_index"] == 9)]

Unnamed: 0,dup_index,original_index,filename,seconds,class
0,9,36,bird_131.wav,1,bird


## Putting it all together.

Here is the final algorithm that will iterate over all segments, find the duplicates in each segment, then bring it all back together into one single dataframe with will be saved as duplicates.csv.

In [16]:
# Read the metadata file
df = pd.read_csv("./Data/metadata.csv") # You may need to adjust the path to find the Data folder

# Remove entries with duration 0 seconds, these files are likely to be too short
indexZero = df[ df['seconds'] == 0 ].index
df.drop(indexZero , inplace=True)

# Group the df by class and seconds. 
g = df.groupby('class')['seconds'].apply(lambda x: np.unique(x))
classes = g.index
uni_durations = g.values

# Split the df into sub dataframes based on class and duration
sub_dfs = []
for cls, dur in zip(classes, uni_durations):
        for d in dur:
            sub_df = df.loc[(df["class"] == cls) & (df["seconds"] == d)]
            sub_dfs.append(sub_df)

# Extract audio features for each file in each sub_df
grped_dfs = []
for sub_df in sub_dfs:
        grp_df = sub_df
        audios, srs = [], []
        for fn, cls in zip(grp_df["filename"], grp_df['class']):
                path = "./Data/"  # You may need to adjust the path to find the Data folder
                fp = f"{path}{cls}/{fn}"

                try:
                        audio, sr = librosa.load(fp, res_type="kaiser_best")
                        audios.append(audio)
                        srs.append(sr)
                except(ValueError):
                        audios.append(np.nan)
                        srs.append(np.nan)

        grp_df['audio'] = audios
        grp_df.drop('Unnamed: 0', axis=1, inplace=True)
        grped_dfs.append(grp_df)


find_dup_dfs = []
for grped_df in grped_dfs:

    find_dup_df = grped_df
    # Reset index because we will want to be able to reference the original index 
    find_dup_df = find_dup_df.reset_index()
    find_dup_df.rename(columns = {'index':'original_index'}, inplace = True)

    # Algorithm to find duplicates
    a = find_dup_df['audio']
    n =  len(a)
    l, r = 0, 1

    while(r < n):
        current_row = a.iloc[l]
        subseq_rows = a.iloc[r:]
        dup= []
        for i in subseq_rows:
            try:
                if np.sum(current_row - i) == 0:
                    dup.append("Y")
                else:
                    dup.append("N")
            except(ValueError):
                dup.append("N")
        
        n_nas = n - len(dup)
        l_nas = [np.nan] * n_nas
        dup = l_nas+dup
        find_dup_df[f"dup_{l}"] = dup
        l += 1
        r += 1

    #Create a copy and reset index to reference duplicates later
    find_dup_copy = find_dup_df.reset_index()
    find_dup_copy.rename(columns = {'index':'dup_index'}, inplace = True)

    #Filter each df to show only "Y" duplicates
    filtered_dup_df = find_dup_df[find_dup_df.isin(['Y']).any(axis=1)]
    filtered_dup_df = filtered_dup_df.reset_index()
    filtered_dup_df.rename(columns = {'index':'dup_index'}, inplace = True)
    
    dup_cols = []
    for i in range(l): # Reuse pointer from before
        dup_col_name = f"dup_{i}"
        dup_cols.append(dup_col_name)

    dups = []
    for dup_col in dup_cols:
        s = filtered_dup_df.loc[(filtered_dup_df[dup_col] == "Y")]
        s = s.loc[:, ['dup_index', 'filename', 'seconds', 'class', dup_col]]
        if len(s) > 0:
            dup_pair = (s['dup_index'].values[0], s[dup_col].name)
            dups.append(dup_pair)
            # print(s['dup_index'].values[0], s[dup_col].name)

    find_dup_copy = find_dup_copy.loc[:, ['dup_index', 'original_index', 'filename', 'seconds', 'class']]
    filtered_dup_df = filtered_dup_df.loc[:, ['dup_index', 'original_index', 'filename', 'seconds', 'class']]
    
    if len(filtered_dup_df) > 0:
        ps = []
        for d in dups:  

            di = d[0]
            di_2 = d[1].partition('_')[2]

            p = filtered_dup_df.loc[(filtered_dup_df['dup_index'] == di)]
            p2 = find_dup_copy.loc[(find_dup_copy['dup_index'] == int(di_2))]
            p['dup_filename'] = p2.filename.values[0]
            ps.append(p.values)
        
        new_rows = []
        for p in ps:
            new_row = list(p[0])
            new_rows.append(new_row)

        new_df = pd.DataFrame(new_rows, columns=['dup_index', 'original_index', 'filename', 'seconds', 'class', 'duplicate_filename'])
        find_dup_dfs.append(new_df)

final_dups = pd.concat(find_dup_dfs) 
final_dups = final_dups.reset_index()
final_dups.drop("index", axis=1, inplace=True)     
final_dups.to_csv(f'{path}/duplicates.csv')              
            

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  grp_df['audio'] = audios
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  grp_df.drop('Unnamed: 0', axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  grp_df['audio'] = audios
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/ind

And we're done. Lets check the final dataframe

In [17]:

final_dups


Unnamed: 0,dup_index,original_index,filename,seconds,class,duplicate_filename
0,9,36,bird_131.wav,1,bird,bird_12.wav
1,10,37,bird_132.wav,1,bird,bird_13.wav
2,19,78,bird_17.wav,1,bird,bird_136.wav
3,21,89,bird_18.wav,1,bird,bird_137.wav
4,24,100,bird_19.wav,1,bird,bird_138.wav
5,29,112,bird_20.wav,1,bird,bird_139.wav
6,14,56,bird_15.wav,2,bird,bird_134.wav
7,16,67,bird_16.wav,2,bird,bird_135.wav
8,3,45,bird_14.wav,5,bird,bird_133.wav
9,3,213,cat_110.wav,1,cat,cat_104.wav


Let's check a couple.

In [18]:
example_file = './Data/bird/bird_131.wav'
ipd.Audio(example_file)

In [19]:
example_file = './Data/bird/bird_12.wav'
ipd.Audio(example_file)

Sounds identical to me. Let's try another few.

In [20]:
example_file = './Data/chicken/chicken_28.wav'
ipd.Audio(example_file)

In [21]:
example_file = './Data/chicken/chicken_16.wav'
ipd.Audio(example_file)

In [22]:
example_file = './Data/cow/cow_68.wav'
ipd.Audio(example_file)

In [23]:
example_file = './Data/cow/cow_54.wav'
ipd.Audio(example_file)

There you have it. It's good that we found a way to quickly check for duplicates as having these in the model will potentially give misleading results, especially if one of these duplicate pairs was involved in training and the other in testing. Not good.

We'll leave it here, but next time we'll clean the data and remove these files then train a very simple neural network.