# Filtering Profanity and Duplicates 

In this phase, we will filter profanity and duplicates from the text. This is important to do because profanity can be offensive to some people, and duplicates can make the text look messy and difficult to read.We are going to use spacy and profanity filter to remove the offensive words and replace it with regular expressions. Duplicate records are time consuming and can affect proper analysis and hence need to be dropped from the dataset. 


In [1]:
#importing basic package
import pandas as pd

In [2]:
# loading the dataset
data = pd.read_excel('Data for AI Assignment.xlsx')
data.head()

Unnamed: 0,Text,Classification
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [3]:
#examining the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18001 entries, 0 to 18000
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Text            18001 non-null  object
 1   Classification  18001 non-null  object
dtypes: object(2)
memory usage: 281.4+ KB


In [4]:
# Identifying duplicate records
duplicates = data[data.duplicated(subset="Text")]
duplicates

Unnamed: 0,Text,Classification
5067,i feel on the verge of tears from weariness i ...,joy
6133,i still feel a craving for sweet food,love
6563,i tend to stop breathing when i m feeling stre...,anger
7623,i was intensely conscious of how much cash i h...,sadness
7685,im still not sure why reilly feels the need to...,surprise
8246,i am not amazing or great at photography but i...,love
9596,ive also made it with both sugar measurements ...,joy
9687,i had to choose the sleek and smoother feel of...,joy
9769,i often find myself feeling assaulted by a mul...,sadness
9786,i feel im being generous with that statement,joy


In [5]:
#getting the length of the duplicates
duplicate_cnt=len(duplicates)
print("There are",duplicate_cnt,"duplicate records.")

There are 43 duplicate records.


In [6]:
#the list of deleted records
removed_records = duplicates.to_dict(orient="records")
removed_records

[{'Text': 'i feel on the verge of tears from weariness i look at your sweet face and cant help but tenderly kiss your cheeks',
  'Classification': 'joy'},
 {'Text': 'i still feel a craving for sweet food', 'Classification': 'love'},
 {'Text': 'i tend to stop breathing when i m feeling stressed',
  'Classification': 'anger'},
 {'Text': 'i was intensely conscious of how much cash i had left in my gas and food envelope and i still have what i intended to save for next week which helps me not feel so stressed and scared',
  'Classification': 'sadness'},
 {'Text': 'im still not sure why reilly feels the need to be so weird',
  'Classification': 'surprise'},
 {'Text': 'i am not amazing or great at photography but i feel passionate about it',
  'Classification': 'love'},
 {'Text': 'ive also made it with both sugar measurements but i feel like cup is just too sweet for me',
  'Classification': 'joy'},
 {'Text': 'i had to choose the sleek and smoother feel of the sweet revenge made drawing and 

In [7]:
# Removing duplicate records based on the "Text" column
data.drop_duplicates(subset="Text", inplace=True)

data.head()

Unnamed: 0,Text,Classification
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [8]:
#examining the new dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17958 entries, 0 to 18000
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Text            17958 non-null  object
 1   Classification  17958 non-null  object
dtypes: object(2)
memory usage: 420.9+ KB


In [69]:
#importing profanity removal package

import spacy
from profanityfilter import ProfanityFilter
import re


In [94]:
# Function to filter and clean profane words from text

def filter_and_clean_profanities_and_return_detected_profanities(text):
   
    censored_text = profanity_filter.censor(text)

   
    detected_profanities = profanity_filter.is_profane(text)

    return censored_text, detected_profanities

In [95]:
# Example phrases to test the code.

example_phrases = [
  "This is a sample sentence.",
  "This sentence contains a profanity: damn."
]

# Filter the example phrases, returning the cleaned words.
filtered_phrases = [filter_and_clean_profanities_and_return_detected_profanities(phrase) for phrase in example_phrases]

# Printing the cleaned words.
for phrase in filtered_phrases:
  print(phrase)


('This is a sample sentence.', False)
('This sentence contains a profanity: ****.', True)


In [96]:
# Applying the function to the 'Text' column
data['Filtered_Text'] = data['Text'].apply(filter_and_clean_profanities_and_return_detected_profanities)
data['Filtered_Text'].head()

0                     (i didnt feel humiliated, False)
1    (i can go from feeling so hopeless to so damne...
2    (im grabbing a minute to post i feel greedy wr...
3    (i am ever feeling nostalgic about the firepla...
4                        (i am feeling grouchy, False)
Name: Filtered_Text, dtype: object

In [121]:
#Making a function to clean the final output

def clean_text(text):
  
    text = text.replace('(', '').replace(')', '')
    
    
    text = re.sub(r', (False|True)', '', text)
    
   
    text = text.replace('*****', '')

    return text

In [123]:
# Example usage
original_text = "(I feel like I have a ton of catching up to do, False)"
cleaned_text = clean_text(original_text)
cleaned_text

'I feel like I have a ton of catching up to do'

In [124]:
# Applying the function to the 'Text' column
data['Filtered_Text'] = data['Filtered_Text'].apply(clean_text)
data['Filtered_Text'].head()

0                            'i didnt feel humiliated'
1    'i can go from feeling so hopeless to so damne...
2    'im grabbing a minute to post i feel greedy wr...
3    'i am ever feeling nostalgic about the firepla...
4                               'i am feeling grouchy'
Name: Filtered_Text, dtype: object

In [125]:
# Saving the filtered DataFrame to a new CSV file
data.to_csv('filtered_data.csv', index=False)