# Sentiment Analysis of Comments: Identifying Insulting Content

## 1. Business Understanding

### 1.1 Project description

In the digital age, with the rise of online platforms and social media, it's crucial to ensure that user-generated content maintains a certain level of decorum. With this project, we aim to analyze a dataset containing comments, indicating whether a comment is insulting or not. Through data analysis and modeling, we aspire to predict and classify comments into these categories, thus automating the process of content moderation to a certain degree.

### 1.1 Project description

Data Collection: Gather a dataset containing user comments,
Data Preprocessing: Prepare the dataset for analysis by performing text preprocessing, which includes tasks such as tokenization, lowercasing, removal of punctuation, and stop word removal.

Exploratory Data Analysis: Conduct exploratory data analysis to gain insights into the distribution of comments, temporal patterns, and other relevant trends in the dataset.

Feature Engineering: Create appropriate features from the text data, such as word embeddings, n-grams, and other relevant linguistic features.

Machine Learning Models: Develop and train machine learning models to predict and classify comments into insulting and non-insulting categories. Evaluate the model's performance through metrics such as accuracy, precision, recall, and F1-score.

## 2. Data Understanding


## Creating DataFrames with the Needed Input Files:

### 2.1. Importing data

In [80]:
import pandas as pd 
import re
from string import punctuation
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from wordcloud import WordCloud

 Load the dataset from a CSV file


In [81]:
df = pd.read_csv('test.csv')


Rename columns for clarity


In [82]:
df.rename(columns={'id': 'id', 'Comment': 'TextComment'}, inplace=True)

In [83]:
df['CommentLength'] = df['TextComment'].apply(len)
df_len = len(df.index)

### 2.2. Data exploration

### 2.2.1 Columns description

- 'id': A unique identifier for each comment.
- 'TextComment': The raw text content of the user's comment.
- 'CommentLength': The length (number of characters) of each comment.

## 3. Data preparation and pre-processing

### 3.1. Data cleaning

Clean the text by removing punctuation


In [84]:
clean_text = re.sub(f"[{re.escape(punctuation)}]", "", 'TextComment')

function to remove certain characters from a sentence before tokenization


In [85]:
def remove_characters_before_tokenization(sentence,keep_apostrophes=False):
    sentence = sentence.strip()
    if keep_apostrophes:
        PATTERN = r'[?|$|&|*|%|@|(|)|~]' 
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    else:
        PATTERN = r'[^a-zA-Z0-9 ]' 
        filtered_sentence = re.sub(PATTERN, r'', sentence)
    return filtered_sentence

In [86]:
clean_text1 = [remove_characters_before_tokenization(i) for i in clean_text ]


In [87]:
#def nlp_preprocessing(text):
    
#Tokenize
    tokens = word_tokenize(text)
    
#Convert Tokens to lower case

    tokens = [token.lower() for token in tokens]
    
#Remove stopwords common words that do not carry significant meaning  

    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
#Lemmatize the tokens   

    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
#Remove non-alphabetic tokens

    tokens = [token for token in tokens if token.isalpha()]
    
#Join the processed tokens back into a single string
    
    all_comments= ' '.join(tokens)
    
    return all_comments


IndentationError: unexpected indent (<ipython-input-87-eebb3012af83>, line 4)

Apply the NLP preprocessing function to each comment and create a new column 'ProcessedComment'

Detecting missing data

In [88]:
miss_val_df=df.isnull().sum(axis=0)/df_len
miss_val_df

df.dropna(inplace=True)


In [89]:
df['ProcessedComment'] = df['TextComment'].apply(nlp_preprocessing)

#Only the relvant columns

df = df[['id', 'TextComment', 'ProcessedComment','CommentLength']]
df.head(50)


Unnamed: 0,id,TextComment,ProcessedComment,CommentLength
0,1,"""like this if you are a tribe fan""",like tribe fan,34
1,2,"""you're idiot.......................""",idiot,37
2,3,"""I am a woman Babs, and the only ""war on women...",woman babs war woman see coming jackazzes like...,201
3,4,"""WOW & YOU BENEFITTED SO MANY WINS THIS YEAR F...",wow benefitted many win year bat nice stupid,70
4,5,"""haha green me red you now loser whos winning ...",haha green red loser who winning moron,56
5,6,"""\nMe and God both hate-faggots.\n\nWhat's the...",god difference refrigerator fart put meat,144
6,7,"""Oh go kiss the ass of a goat....and you DUMMY...",oh go kiss as goat dummycraps insult veteran e...,185
7,8,"""Not a chance Kid, you're wrong.""",chance kid wrong,33
8,9,"""On Some real Shit FUck LIVE JASMIN!!!""",real shit fuck live jasmin,39
9,10,"""ok but where the hell was it released?you all...",ok hell released copy article pther anyone fuc...,165


### 3.2. Data visualization

In [94]:
wordcloud = WordCloud(background_color="white", max_words=200, width=400, height=400, random_state=1).generate(' '.join(df['ProcessedComment']))

plt.figure(figsize=(8, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show

AttributeError: 'TransposedFont' object has no attribute 'getbbox'

# Feature Engineering

Text Vectorization

In [78]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000)
X = tfidf.fit_transform(df['ProcessedComment'])
