# Profanity Detection Model

We will be creating a model to identify profane words and develop a system that
can be used to check for profanity in existing text. The goal is to build a robust profanity detection system that can
accurately flag and filter out offensive or inappropriate language.<br>
## Tasks
1. [Dataset Acquisition](#section1)
2. [Data Preprocessing](#section2)
3. [Model Development & Training](#section3)
4. [Metrics & Evaluation](#section4)
5. [User Input](#section5)
6. [Testing Nameset](#section6)


<a id="section1"></a>
## Dataset Acquisition
Objective is to obtain a dataset that contains a variety of examples of profane words and offensive language. Should be labeled and of sufficient size. The dataset that we will be using is sourced from:<br>
Thomas Davidson - [Hate Speech & Offensive Language](https://github.com/t-davidson/hate-speech-and-offensive-language)<br>
It contains a list of tweets out of which some have profane language, and some which do not.<br>
We can see a preview of some of the data being used.

In [9]:
import pandas as pd

excel_file = 'C:/Users/ASUS/Desktop/Profanity Check/My data.xlsx'
df = pd.read_excel(excel_file)

df.tail(10)

Unnamed: 0,Sentence,Class
24773,you niggers cheat on ya gf's? smh....,1
24774,you really care bout dis bitch. my dick all in...,1
24775,"you worried bout other bitches, you need me for?",1
24776,you're all niggers,1
24777,you're such a retard i hope you get type 2 dia...,1
24778,you's a muthaf***in lie &#8220;@LifeAsKing: @2...,1
24779,"you've gone and broke the wrong heart baby, an...",0
24780,young buck wanna eat!!.. dat nigguh like I ain...,1
24781,youu got wild bitches tellin you lies,1
24782,~~Ruffled | Ntac Eileen Dahlia - Beautiful col...,0


In [4]:
print("Shape of Dataset is: ",df.shape,"\n")
print("Types of each column:\n",df.dtypes,sep="")


Shape of Dataset is:  (24783, 4) 

Types of each column:
Sentence           object
Offensive           int64
Non - offensive     int64
Class               int64
dtype: object


##### Results
There are 24,783 sentences in this set, out of which the following number are Profane - Not Profane<br>

| Type        | Count |
|-------------|-------|
| Profane     | 20620 |
| Non-Profane | 4163 |

There are lots of extra unwanted characters in the data, hence we will have to preprocess the data before training a model.

<a id="section2"></a>
## Data Preprocessing
As the dataset we are working on are tweets, we will clean the data by removing extra characters,punctuations etc.

#### Soundex function

In [None]:
def soundex(token):

    # Convert the word to upper case
    token = token.upper()
    soundexx = ""

    # Retain the First Letter
    soundexx += token[0]

    # Create a dictionary which maps letters to respective soundex
    # codes. Vowels and 'H', 'W' and 'Y' will be represented by '.'
    dictionary = {"BFPV": "1", "CGJKQSXZ": "2",
                  "DT": "3",
                  "L": "4", "MN": "5", "R": "6",
                  "AEIOUHWY": "."}

    # Encode as per the dictionary
    for char in token[1:]:
        for key in dictionary.keys():
            if char in key:
                code = dictionary[key]
                if code != '.':
                    if code != soundexx[-1]:
                        soundexx += code

    # Trim or Pad to make Soundex a
    # 7-character code
    soundexx= soundexx[:4].ljust(4, "0")
    # im making it 4, change 4 to 7 if req
    return soundexx



In [10]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
sen = df["Sentence"]


#Defining a lemmatizer
#Removes stop words, words with symbols and words of length < 3
#Also converts all words to lower case
def lemmatize_words(text):
       words = word_tokenize(text)
       words = [(lemmatizer.lemmatize(word,pos='v')).lower() for word in words if not word in stop_words if len(word)>=3 if word.isalpha()]
       return ' '.join(words)


# Soundex prototype 1
# Pls delete and uncomment above thingy
# def lemmatize_words(text):
#        words = word_tokenize(text)
#        words = [soundex(lemmatizer.lemmatize(word,pos='v')) for word in words if not word in stop_words if len(word)>=3 if word.isalpha()]
#        return ' '.join(words)

df["clean_Sentence"] = sen.apply(lemmatize_words)

print(df.head(10))


                                            Sentence  Class  \
0  !!! RT @mayasolovely: As a woman you shouldn't...      0   
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...      1   
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...      1   
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...      1   
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...      1   
5  !!!!!!!!!!!!!!!!!!"@T_Madison_x: The shit just...      1   
6  !!!!!!"@__BrighterDays: I can not just sit up ...      1   
7  !!!!&#8220;@selfiequeenbri: cause I'm tired of...      1   
8  " &amp; you might not get ya bitch back &amp; ...      1   
9  " @rhythmixx_ :hobbies include: fighting Maria...      1   

                                      clean_Sentence  
0  mayasolovely woman complain clean house amp ma...  
1    boy dats cold tyga dwn bad cuffin dat hoe place  
2  urkindofbrand dawg you ever fuck bitch start c...  
3                                   look like tranny  
4  shenikaroberts the shit hear

<a id="section3"></a>
## Model Development

The data has been cleaned, now we proceed to import the required libraries and start training the model.<br>
The classification will be done using the Support Vector Machine (SVM) model and the Logistic Regression model and the more accurate one will be used in the end.

In [6]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### SVM Model

In [24]:
x = np.array(sen)
y = np.array(df["Class"])

#Vectorizing the text
cv = CountVectorizer()
x = cv.fit_transform(x)
# print(x)


#Splitting the Dataset
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.33, random_state= 0, shuffle= False)
clf = DecisionTreeClassifier()
clf.fit(x_train,y_train)



### Logistic Regression

In [12]:
x = np.array(sen)
y = np.array(df["Class"])

#Vectorizing the text
vect = TfidfVectorizer()
xx = vect.fit_transform(x)

#Splitting the Dataset
xx_train, xx_test, yy_train, yy_test = train_test_split(xx,y, test_size = 0.33, random_state= 0, shuffle= False)
logreg = LogisticRegression()
logreg.fit(xx_train,yy_train)



<a id="section4"></a>
## Metrics & Evaluation

In [25]:
## SVM MODEL

from sklearn.metrics import accuracy_score, f1_score, precision_score
from sklearn.metrics import recall_score, confusion_matrix


y_pred = clf.predict(x_test)
print("Metrics for SVM Model: \n")
print("Accuracy:",accuracy_score(y_test, y_pred))
print("Precision:",precision_score(y_test, y_pred))
print("Recall:",recall_score(y_test, y_pred))
print("F1 Score:",f1_score(y_test, y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test, y_pred))



Metrics for SVM Model: 

Accuracy: 0.9543954028609855
Precision: 0.9772054470100652
Recall: 0.9678932707814103
F1 Score: 0.9725270678353097
Confusion Matrix:
 [[1204  154]
 [ 219 6602]]


In [14]:
## LOGISTIC REGRESSION

from sklearn.metrics import accuracy_score, f1_score, precision_score
from sklearn.metrics import recall_score, confusion_matrix

yy_pred = logreg.predict(x_test)
print("Metrics for Logistic Regression: \n")
print("Accuracy:",accuracy_score(y_test, yy_pred))
print("Precision:",precision_score(y_test, yy_pred))
print("Recall:",recall_score(y_test, yy_pred))
print("F1 Score:",f1_score(y_test, yy_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test, yy_pred))

Metrics for Logistic Regression: 

Accuracy: 0.9402127399437584
Precision: 0.9595065312046445
Recall: 0.9692127254068318
F1 Score: 0.9643352053096055
Confusion Matrix:
 [[1079  279]
 [ 210 6611]]


Since the Accuracy, and F1 score both are higher for SVM than Logistic Regression, from now on we will be using the SVM model for testing data.<br>
We can also look at the falsely predicted values if required to try to notice any pattern in where the model is making mistakes.

#### FP and FN

In [15]:
False_pos=[]
False_neg=[]


for i in range(len(y_test)):
       if (y_test[i]==1 and y_pred[i]==0):
              False_neg.append(x_test[i])
              # print(x_test[i])
       if (y_test[i]==0 and y_pred[i]==1):
              False_pos.append(x_test[i])
       #        print(x_test[i])


for j in False_neg:
       k = cv.inverse_transform(j)
       print(" ".join(k[0]))
       print("\n")

# for j in False_pos:
#        k = cv.inverse_transform(j)
#        print(" ".join(k[0]))
#        print("\n")

# For LogReg, replace y_test with yy_test
# Replace cv with logreg


8230 always and began boated clam flow go good had hallmarknostracollecti juicy little man mary nostradamnisuck of rt she to touch was


any in is it now_thats_fresh rt song trash tyga with


127480 127482 badass bald bird change eagle if logo merica ourgreatamerica rt should the think to twitter you


all cowboys poonsoaker rt so trash


999 bird boy done flappy im in life poisonedkissx3 reached rt so this white with


co girls have how http jprbhuxfdp love pornpunter pussies rt sweet


8230 at em has https in just of office on or out politicians rt run rwsurfergirl the throw toss trash trashbucketchallenge


all dam dental gotta have is luuuube luuuuube need racheldoesstuff really rt ya yankees you


co http of pic posing rakwonogod rare rt trash usgiazsdwv with


8230 99 be bring food from ghetto girl if into might movies no outside pay ratchet2english rt stupid the tweeted you


aight co dykes game had http it o0can6gb1p over rihannahasaids rt ruin to


8230 99 be bring food from g

<a id="section5"></a>
## Input test


In [88]:

def test(x):

    x = x.lower()
    #Replacing symbols
    # replacements = {'$':'s', '@':'a', '4':'a', '8':'b', '3':'e', '1':'i', '0':'o', '5':'s', '7':'t' }
    # x = ''.join([replacements.get(char,char) for char in x])

    y = cv.transform([x]).toarray()
    ans =  clf.predict(y)

    bool, booli = False, True
    for k in x.split():
        if (k.lower() in ex):
            booli = False
        if (k in hi):
            bool = True
    if ((ans and booli) or bool):
        return 1
    return 0

a = input("Enter the sentence: ")
ans = test(a)

print("Profane") if (ans==1) else print("Not Profane")


Enter the sentence: bhosdike
Profane


<a id="section6"></a>
## Testing Nameset

In [18]:
excel_ = 'C:/Users/ASUS/Desktop/Profanity Check/Name Set.xlsx'
tf = pd.read_excel(excel_)

tf.head(10)
names = tf['Names']

Unnamed: 0,Names
0,Raj Junior
1,Srijan Sinha
2,Jagdish Patel
3,Raju Madur
4,Ahaan Khan
5,Nabila Shaikh
6,Kabir Sud
7,Aditya Bhattacharyya
8,Mirtunjay Kumar
9,Gulshan Kumar


In [21]:
X = np.array(names)
X = X.astype(str)

#Vectorizing the text
X = cv.fit_transform(X)
# print(x)


In [None]:
import time

start = time.time()

for i in names:
    out = test(i)
    bool, booli = False, True
    # if (out==1):
    #     print(i)
    for k in i.split():
        if (k.lower() in ex):
            booli = False
        if (k in hi):
            bool = True
    if ((out and booli) or bool):
        print(i)

end = time.time()

print(end-start)

In [None]:
YP = clf.predict(X)

##### Loading Hindi dataset

In [86]:
h = 'C:/Users/ASUS/Desktop/Profanity Check/Hinglish_Profanity_List.csv'
hin = pd.read_csv(h)
hin = hin['badir']

In [87]:
hi = hin.values.tolist()
ex = ['deep','ms','folks', 'smile','aware', 'da', 'mutha']

In [77]:
import time

start = time.time()

for i in names:
    out = test(i)
    bool, booli = False, True
    # if (out==1):
    #     print(i)
    for k in i.split():
        if (k.lower() in ex):
            booli = False
        if (k in hi):
            bool = True
    if ((out and booli) or bool):
        print(i)

end = time.time()

print(end-start)

Manoranjan Shit
subhash  jaat
Ho Ok Ooyyluc Hchchcchku
Bhanwarlal jaat
Bhanwarlal jaat
Bhai Ho
Babulal Godara jaat
Pavan jaat
keethi kutti
Sayan Shit
niku chutia
Akash Darkie
Dilip  Omg
Sanu Shit
OMG Tiub OMG Tiub
Tej Ho  Joshi
WHITE  DEVIL 
Santanu Shit
Omg Gupta
Black White
Bhabani Sankar  Shit
Sanket jaat
Deepanshu  joon
White Devil
Santu Kumar Shit
Shraban  Shit
Rahul Tried
Omg Godghase
abid anus abid anus
abid anus abid anus
Homework Revice
Hell XD
Hook Hookio
Krishanu Shit
Suman Shit
Debendra Nath  Shit
basavaraju  bc
santanu shit
Vinit jaat
Tapas  Shit
Omg Rai
Tapas Sheet
Gopal Shit
Mohammed Ali  General
sahil shet
Ujjal Shit
Biswajit  Shit
John Ho
SUDESH  SHIT
writtik shit
Sukdeb Sheet
KUNTAL SHEET
subhash jaat
Manik Sheet
SNEHASISH SHIT
WHITE WOLF_YT ALL GAMES
Manik Sheet
White Devil
Manik Shit
Apu  Shit
MANOJIT SHIT
Haramohan Shit
Samir Shit
Samir Shit
Ass Reddy
Suva Shit
Manish  joon
Omg Tamizhaa
Sumati  Shit
Suman Shit
Kamal Deep.U
169.38316893577576
