Kheshini Budhna (C0909662)

Purpose of processing_1:
- Find duplicate values and remove from 'tgif-v1.0.tsv'
- indexing the dataset for streamlined processing of frames

Code description: 
- Importing original tsv file 
- Renaming columns 
- Checking for duplicate links and descriptions
- Dumping duplicate links with unique description in the file 'duplicated_links_descriptions.tsv'
- Deleting duplicate records from the main dataframe
- Indexing the unique records in the main dataframe 
- Dumping all unique records & creating a second version of the dataset 'tgif-v2.0.tsv'

Requirements to run this file:
- To be run on Colab (using jupyter notebook will require changes to the code)
- version 1 of the dataset 'tgif-v1.0.tsv'

In [27]:
import pandas as pd
from google.colab import drive

# Step 1: Mount Google Drive
drive.mount('/content/drive', force_remount=True)

# Step 2: Load TSV file into a DataFrame without using the first line as the header
file_path = '/content/drive/My Drive/Colab_Notebooks/tgif-v1.0.tsv'  # Correct file path
df = pd.read_csv(file_path, sep='\t', header=None)  # Load TSV

# Display the first few rows of the DataFrame
df.head()


Mounted at /content/drive


Unnamed: 0,0,1
0,https://38.media.tumblr.com/9f6c25cc350f12aa74...,"a man is glaring, and someone with sunglasses ..."
1,https://38.media.tumblr.com/9ead028ef62004ef6a...,a cat tries to catch a mouse on a tablet
2,https://38.media.tumblr.com/9f43dc410be85b1159...,a man dressed in red is dancing.
3,https://38.media.tumblr.com/9f659499c8754e40cf...,an animal comes close to another in the jungle
4,https://38.media.tumblr.com/9ed1c99afa7d714118...,a man in a hat adjusts his tie and makes a wei...


In [28]:
# Rename the columns
df.columns = ['links', 'description']  # Replace with your actual column names

# Display the first few rows of the DataFrame
df.head()


Unnamed: 0,links,description
0,https://38.media.tumblr.com/9f6c25cc350f12aa74...,"a man is glaring, and someone with sunglasses ..."
1,https://38.media.tumblr.com/9ead028ef62004ef6a...,a cat tries to catch a mouse on a tablet
2,https://38.media.tumblr.com/9f43dc410be85b1159...,a man dressed in red is dancing.
3,https://38.media.tumblr.com/9f659499c8754e40cf...,an animal comes close to another in the jungle
4,https://38.media.tumblr.com/9ed1c99afa7d714118...,a man in a hat adjusts his tie and makes a wei...


In [29]:
# Check for duplicates in the "links" column
duplicates_links = df[df.duplicated('links', keep=False)]  # 'keep=False' to mark all duplicates
print("Duplicate links:")
print(duplicates_links[['links']])

Duplicate links:
                                                    links
48      https://38.media.tumblr.com/8cc0d8b94f1694a74f...
49      https://38.media.tumblr.com/8cc0d8b94f1694a74f...
59      https://38.media.tumblr.com/712368408b70ce32ee...
60      https://38.media.tumblr.com/712368408b70ce32ee...
64      https://38.media.tumblr.com/7043680fa969949ff4...
...                                                   ...
125777  https://38.media.tumblr.com/5c0633e677a97a023c...
125778  https://38.media.tumblr.com/402a02c59c7c47c300...
125779  https://38.media.tumblr.com/02fa66bd747ddbed58...
125780  https://38.media.tumblr.com/01e70784925ab9fe09...
125781  https://38.media.tumblr.com/51d2172ef413bc3e88...

[36047 rows x 1 columns]


In [30]:
duplicates_links[['links']]

Unnamed: 0,links
48,https://38.media.tumblr.com/8cc0d8b94f1694a74f...
49,https://38.media.tumblr.com/8cc0d8b94f1694a74f...
59,https://38.media.tumblr.com/712368408b70ce32ee...
60,https://38.media.tumblr.com/712368408b70ce32ee...
64,https://38.media.tumblr.com/7043680fa969949ff4...
...,...
125777,https://38.media.tumblr.com/5c0633e677a97a023c...
125778,https://38.media.tumblr.com/402a02c59c7c47c300...
125779,https://38.media.tumblr.com/02fa66bd747ddbed58...
125780,https://38.media.tumblr.com/01e70784925ab9fe09...


In [31]:
# Find duplicate links and their corresponding descriptions
duplicates_links = df[df.duplicated('links', keep=False)]

# Display the duplicate links along with their descriptions
print("Duplicate links and their matching descriptions:")
print(duplicates_links[['links', 'description']])

Duplicate links and their matching descriptions:
                                                    links  \
48      https://38.media.tumblr.com/8cc0d8b94f1694a74f...   
49      https://38.media.tumblr.com/8cc0d8b94f1694a74f...   
59      https://38.media.tumblr.com/712368408b70ce32ee...   
60      https://38.media.tumblr.com/712368408b70ce32ee...   
64      https://38.media.tumblr.com/7043680fa969949ff4...   
...                                                   ...   
125777  https://38.media.tumblr.com/5c0633e677a97a023c...   
125778  https://38.media.tumblr.com/402a02c59c7c47c300...   
125779  https://38.media.tumblr.com/02fa66bd747ddbed58...   
125780  https://38.media.tumblr.com/01e70784925ab9fe09...   
125781  https://38.media.tumblr.com/51d2172ef413bc3e88...   

                                              description  
48      a guy is making a few turns in a throne and th...  
49      a man is sitting in a room, he has a gun on hi...  
59      a woman is touching a colored 

In [32]:
# Step 1: Filter to find duplicated links
duplicate_links = df[df.duplicated('links', keep=False)]

# Step 2: Count unique duplicated links
unique_duplicate_links_count = duplicate_links['links'].nunique()

# Step 3: Display the count of unique duplicated links
print(f"Total number of unique duplicated links: {unique_duplicate_links_count}")

Total number of unique duplicated links: 12333


In [33]:
unique_duplicated_links = duplicate_links['links'].unique()

# Step 3: Print 2-3 duplicated links with their descriptions
print("Duplicated Links and Descriptions:")
for link in unique_duplicated_links[:3]:  # Limit to 2-3 links
    matching_descriptions = duplicate_links[duplicate_links['links'] == link]['description'].tolist()

    print(f"Link: {link}")
    for i, description in enumerate(matching_descriptions):
        print(f"Description {i + 1}: {description}")
    print()  # Add a blank line for better readability

Duplicated Links and Descriptions:
Link: https://38.media.tumblr.com/8cc0d8b94f1694a74f8754de3934ae2d/tumblr_nqir72vsUx1sd7dido1_500.gif
Description 1: a guy is making a few turns in a throne and then points a gun.
Description 2: a man is sitting in a room, he has a gun on his right hand

Link: https://38.media.tumblr.com/712368408b70ce32ee4eb0f9f6fe6639/tumblr_noxlpdet5k1uwrtyko1_500.gif
Description 1: a woman is touching a colored object with her finger.
Description 2: a woman is putting her finger on a badge
Description 3: a woman is bending down to touch a pin.
Description 4: this is a woman bending down to touch a sword.

Link: https://38.media.tumblr.com/7043680fa969949ff4ca0bf5fd9f64a1/tumblr_nnkjulMoHH1qfgmtoo1_500.gif
Description 1: a woman is standing behind a microphone and is running her fingers through her hair.
Description 2: a singing female readjusts her hair while looking at her audience
Description 3: a girl is smiling on stage and pushes her hair back.



In [34]:
unique_duplicated_links = duplicate_links['links'].unique()

# Create a new DataFrame to hold duplicated records
duplicated_records = duplicate_links[duplicate_links['links'].isin(unique_duplicated_links)]

# Save to TSV file
duplicated_records.to_csv('duplicated_links_descriptions.tsv', sep='\t', index=False)

print("Duplicated links and descriptions have been exported to 'duplicated_links_descriptions.tsv'.")


Duplicated links and descriptions have been exported to 'duplicated_links_descriptions.tsv'.


In [35]:
# Check for duplicates in both "links" and "description"
duplicates = df[df.duplicated(subset=['links', 'description'], keep=False)]  # 'keep=False' marks all duplicates

print("Duplicate entries in 'links' and 'description':")
print(duplicates[['links', 'description']])


Duplicate entries in 'links' and 'description':
Empty DataFrame
Columns: [links, description]
Index: []


In [37]:
# Identify all records that have duplicates based on the 'links' column
duplicates = df[df.duplicated(subset='links', keep=False)]

# Create a new DataFrame with only non-duplicated records
non_duplicated_df = df[~df['links'].isin(duplicates['links'])]

# Display the shape of the new DataFrame to confirm
print(f"Original DataFrame shape: {df.shape}")
print(f"Non-Duplicated DataFrame shape: {non_duplicated_df.shape}")

# Optionally, you can check the first few rows of the new DataFrame
print(non_duplicated_df.head())

Original DataFrame shape: (125782, 2)
Non-Duplicated DataFrame shape: (89735, 2)
                                               links  \
0  https://38.media.tumblr.com/9f6c25cc350f12aa74...   
1  https://38.media.tumblr.com/9ead028ef62004ef6a...   
2  https://38.media.tumblr.com/9f43dc410be85b1159...   
3  https://38.media.tumblr.com/9f659499c8754e40cf...   
4  https://38.media.tumblr.com/9ed1c99afa7d714118...   

                                         description  
0  a man is glaring, and someone with sunglasses ...  
1           a cat tries to catch a mouse on a tablet  
2                   a man dressed in red is dancing.  
3     an animal comes close to another in the jungle  
4  a man in a hat adjusts his tie and makes a wei...  


In [38]:
# Resetting the index and starting it at 1
non_duplicated_df.reset_index(drop=True, inplace=True)
non_duplicated_df.index += 1  # Adjust the index to start from 1

# Display the indexed DataFrame
print(non_duplicated_df)

                                                   links  \
1      https://38.media.tumblr.com/9f6c25cc350f12aa74...   
2      https://38.media.tumblr.com/9ead028ef62004ef6a...   
3      https://38.media.tumblr.com/9f43dc410be85b1159...   
4      https://38.media.tumblr.com/9f659499c8754e40cf...   
5      https://38.media.tumblr.com/9ed1c99afa7d714118...   
...                                                  ...   
89731  https://38.media.tumblr.com/62493b85c2e2a38b04...   
89732  https://31.media.tumblr.com/1213c26056c39e2e6a...   
89733  https://33.media.tumblr.com/0c4ca6f63e2065d6e4...   
89734  https://33.media.tumblr.com/d94f412610a6f58e59...   
89735  https://38.media.tumblr.com/3ad376c0605028a612...   

                                             description  
1      a man is glaring, and someone with sunglasses ...  
2               a cat tries to catch a mouse on a tablet  
3                       a man dressed in red is dancing.  
4         an animal comes close to another 

In [39]:
# Save the non-duplicated DataFrame to a TSV file
non_duplicated_df.to_csv('tgif-v2.0.tsv', sep='\t', index=True)  # index=True to include the index

print("Non-duplicated data has been exported to 'tgif-v2.0.tsv'.")

Non-duplicated data has been exported to 'tgif-v2.0.tsv'.
