### Data Cleaning


References: [Webis-TLDR-17_filtering.ipynb ](https://github.com/anna-kay/Reddit-summarization/blob/main/notebooks/filtering/Webis-TLDR-17_filtering.ipynb)

About

Last updated: 2024-05-11

Created by: Oksana Kalytenko

In [1]:
import pandas as pd
import numpy as np  
import nltk
import re
import matplotlib.pyplot as plt
import seaborn as sns

import csv

In [2]:
# Load the CSV file again into a DataFrame
webis_tldr_subreddit = pd.read_csv("data/webis_tldr_subreddit_explainlikeimfive.csv",  sep = ';')

# Display the DataFrame
webis_tldr_subreddit.head()

Unnamed: 0,author,body,normalizedBody,subreddit,subreddit_id,id,content,summary
0,Cypriotmenace,Think of it like mailing pages of a book to di...,Think of it like mailing pages of a book to di...,explainlikeimfive,t5_2sokd,c6dydfx,Think of it like mailing pages of a book to di...,"Always look for the highest seeded torrents, a..."
1,senatorskeletor,"""Redistribution"" is short for ""redistribution ...","""Redistribution"" is short for ""redistribution ...",explainlikeimfive,t5_2sokd,c6whsmv,"Redistribution"" is short for ""redistribution o...",1) using the tax system to take money from the...
2,callumgg,"The Chinese system isn't exactly transparent, ...","The Chinese system isn't exactly transparent, ...",explainlikeimfive,t5_2sokd,c6y9grw,"The Chinese system isn't exactly transparent, ...",2700 Delegates and representatives from all ov...
3,mcanerin,Here is an analogy I've used before and might ...,Here is an analogy I've used before and might ...,explainlikeimfive,t5_2sokd,c6yj68l,Here is an analogy I've used before and might ...,"the Communist Party isn't a political party, i..."
4,neo45,"This is a complicated question, but I think it...","This is a complicated question, but I think it...",explainlikeimfive,t5_2sokd,c7fuozw,"This is a complicated question, but I think it...","There's lots of good actors out there, but ver..."


In [3]:
webis_tldr_subreddit.loc[:,'content'] = webis_tldr_subreddit['content'].str.lower()
webis_tldr_subreddit.loc[:,'summary'] = webis_tldr_subreddit['summary'].str.lower()

In [4]:
exact_duplicates = webis_tldr_subreddit['content'].value_counts()[webis_tldr_subreddit['content'].value_counts() > 1]

exact_duplicates_df = pd.DataFrame({'value': exact_duplicates.index, 'occurencies_count': exact_duplicates.values})


exact_duplicates_texts_indices_lists = []

for element in exact_duplicates_df['value'].to_list():
    element_occurence_indices = webis_tldr_subreddit.index[webis_tldr_subreddit['content'] == element].tolist()
    exact_duplicates_texts_indices_lists.append(element_occurence_indices)
    

exact_duplicates_texts_indices = []

for element in exact_duplicates_texts_indices_lists:
    for i in range(1, len(element)):
        exact_duplicates_texts_indices.append(element[i])

In [5]:
with open('webis_tldr_exact_duplicates_texts_indices.txt', 'w') as f:
    for item in exact_duplicates_texts_indices:
        f.write("%s\n" % item)

In [6]:
exact_duplicates_df

Unnamed: 0,value,occurencies_count
0,"as the record turns under it, the stylus rides...",8
1,no. it's actually a weakness in our democratic...,6
2,"said best by /u/itguy_theyrelying "" it's not b...",5
3,what does,4
4,"this is an awesome movie, and was an sf ground...",4
5,a prolific sf writer named l. ron hubbard repo...,4
6,"hey something i know : \n so, first of like ma...",3
7,"so about a century ago, we thought everything ...",3
8,let me clear this up... joseph kony is in fact...,3
9,the biggest expense in space travel is getting...,2


In [7]:
not_useful_texts_indices = []

# Find the indices of the 'documents' that are empty or not text (e.g., punctuation marks only)

# Regular expression pattern that describes text
text_pattern = re.compile("[a-zA-Z0-9]+")

for i in range(len(webis_tldr_subreddit)):
    row = webis_tldr_subreddit.iloc[i]
    content = row['content']
    index = row.name
    
    if pd.isnull(content) or (len(content) == 0) or (not text_pattern.search(content)):
        not_useful_texts_indices.append(index)


In [8]:
indices_to_remove = not_useful_texts_indices + list(exact_duplicates_texts_indices)

In [9]:
webis_tldr_subreddit.drop(indices_to_remove, inplace=True, axis=0)
webis_tldr_subreddit

Unnamed: 0,author,body,normalizedBody,subreddit,subreddit_id,id,content,summary
0,Cypriotmenace,Think of it like mailing pages of a book to di...,Think of it like mailing pages of a book to di...,explainlikeimfive,t5_2sokd,c6dydfx,think of it like mailing pages of a book to di...,"always look for the highest seeded torrents, a..."
1,senatorskeletor,"""Redistribution"" is short for ""redistribution ...","""Redistribution"" is short for ""redistribution ...",explainlikeimfive,t5_2sokd,c6whsmv,"redistribution"" is short for ""redistribution o...",1) using the tax system to take money from the...
2,callumgg,"The Chinese system isn't exactly transparent, ...","The Chinese system isn't exactly transparent, ...",explainlikeimfive,t5_2sokd,c6y9grw,"the chinese system isn't exactly transparent, ...",2700 delegates and representatives from all ov...
3,mcanerin,Here is an analogy I've used before and might ...,Here is an analogy I've used before and might ...,explainlikeimfive,t5_2sokd,c6yj68l,here is an analogy i've used before and might ...,"the communist party isn't a political party, i..."
4,neo45,"This is a complicated question, but I think it...","This is a complicated question, but I think it...",explainlikeimfive,t5_2sokd,c7fuozw,"this is a complicated question, but i think it...","there's lots of good actors out there, but ver..."
...,...,...,...,...,...,...,...,...
25477,tedcase,I took no notice of it at first because of jus...,I took no notice of it at first because of jus...,explainlikeimfive,t5_2sokd,t3_38ys1t,i took no notice of it at first because of jus...,"version of the modern flat earth theory, and w..."
25478,JedWasTaken,I realize that our eyesight can be permanently...,I realize that our eyesight can be permanently...,explainlikeimfive,t5_2sokd,t3_2zk4yp,i realize that our eyesight can be permanently...,or elaborate explanation for this topic!
25479,[deleted],After reading through pages and pages of infor...,After reading through pages and pages of infor...,explainlikeimfive,t5_2sokd,t3_kizn8,after reading through pages and pages of infor...,ers out there.
25480,OneFatTurkey,*now wait wait wait* \n\nI obviously don't mea...,now wait wait wait \n I obviously don't mean t...,explainlikeimfive,t5_2sokd,t3_31l0jz,now wait wait wait \n i obviously don't mean t...,"of this thread: ""evil cop"" incidences have pro..."


In [10]:
# Save data to the "data" folder and use quoting to avoid corrupting the data
webis_tldr_subreddit.to_csv("data/cleaned_webis_tldr_subreddit_explainlikeimfive.csv", index=False, sep=";", encoding='utf-8', quoting=csv.QUOTE_NONNUMERIC)