-
Notifications
You must be signed in to change notification settings - Fork 0
Python: Keep Unique Genotypes
Sean Beagle edited this page Jun 10, 2020
·
12 revisions
import pandas as pdfile_in = 'MAV_H87_matrix806.cfonly.nodupdist.20200608.csv'
df = pd.read_csv(file_in)
print(f'{len(df)} records in DataFrame')6990 records in DataFrame
-
condition1: Both samples are from the same patient. -
condition2: Distance between samples is 20 or less.
... sort DataFrame in ascending order by distance
condition1 = df['patient1'] == df['patient2']
condition2 = df['Dist'] <= 20
duplicates = df[condition1 & condition2].sort_values(by=['Dist'])-
IF one isolate is in
keepTHEN drop the other -
ELSE IF one isolate is in
dropTHEN also drop the other -
ELSE neither isolate is in
keepordropSO keep the first and drop the other
keep = set()
drop = set()
for i, row in duplicates.iterrows():
if row['Species1'] in keep:
drop.add(row['Species2'])
elif row['Species2'] in keep:
drop.add(row['Species1'])
elif row['Species1'] in drop:
drop.add(row['Species2'])
elif row['Species2'] in drop:
drop.add(row['Species1'])
else:
keep.add(row['Species1'])
drop.add(row['Species2'])
unique_isolates = duplicates['Species1'].append(duplicates['Species2']).unique()
print(f"Found {len(unique_isolates)} unique isolates")
print(f'Keeping {len(keep)} isolates')
print(f'Dropping {len(drop)} isolates')Found 74 unique isolates
Keeping 30 isolates
Dropping 44 isolates
-
condition1: Species1 is not in drop set. -
condition2: Species2 is not in drop set.
condition1 = ~df['Species1'].isin(drop)
condition2 = ~df['Species2'].isin(drop)
df2 = df[condition1 & condition2]
print(f'{len(df2)} records in new DataFrame.')4108 records in new DataFrame.
file_out = file_in.replace('.csv', '.FILTERED.csv')
df.to_csv(file_out)