# <p style="text-align:center;"> AI for Writing </p>
![](https://static.scientificamerican.com/sciam/cache/file/C9A31747-FBED-41A5-8820ED507C404BB0_source.jpg)

# develop a web application which accurately calculates the toxicity of a statement that has been provided as an input by the user

# Introduction
Online forums and social media platforms have provided individuals with the means to put forward their thoughts and freely express their opinion on various issues and incidents. In some cases, these online comments contain explicit language which may hurt the readers. Comments containing explicit language can be classified into myriad categories such as Toxic, Severe Toxic, Obscene, Threat, Insult, and Identity Hate. The threat of abuse and harassment means that many people stop expressing themselves and give up on seeking different opinions.
To protect users from being exposed to offensive language on online forums or social media sites, companies have started flagging comments and blocking users who are found guilty of using unpleasant language. Several Machine Learning models have been developed and deployed to filter out the unruly language and protect internet users from becoming victims of online harassment and cyberbullying.

We will build a model model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s

# Our Problem is divided into 2 main phases :
* Modeling
* Deploy in a website

# Modeling

# Problem Statement
* “To build a multi-headed model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate.”

# Library

In [58]:
import numpy as np
from tqdm import tqdm_notebook
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from tqdm import tqdm
tqdm.pandas()
from wordcloud import STOPWORDS
from plotly.subplots import make_subplots
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
import re
from tqdm import tqdm_notebook

from nltk.corpus import stopwords

from tensorflow.keras import regularizers, initializers, optimizers, callbacks
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

In [59]:
train = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip')
test = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv.zip')
test_labels = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test_labels.csv.zip')
submission= pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv.zip')
train.head()

In [60]:
print(train.shape)
train.isnull().sum()

# Exploratory Data Analysis

In [61]:
train['characters length'] = train['comment_text'].apply(len)
train['words length'] = train['comment_text'].apply(lambda x: len(x.split()))
test['characters length'] = test['comment_text'].apply(len)
test['words length'] = test['comment_text'].apply(lambda x: len(x.split()))

# Characters Length Distribution Train

In [62]:
print(train['characters length'].describe())
fig = px.histogram(train, x='characters length', marginal='box')
fig.show()

# Characters Length Distribution Test

In [63]:
print(test['characters length'].describe())
fig = px.histogram(test, x='characters length', marginal='box')
fig.show()

In [64]:
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_train = train["comment_text"]
list_sentences_test = test["comment_text"]

In [65]:
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

In [66]:
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

In [67]:
totalNumWords = [len(one_comment) for one_comment in list_tokenized_train]

# finaly Starting building Model
**This is the architecture of the model we are trying to build. It's always to good idea to list out the dimensions of each layer in the model to think visually and help you to debug later on.**
![](https://i.imgur.com/txJomEa.png)

# Model Creation (LSTM)
It is now time to choose a deep-learning model and train the model using the train-set and the validation-set. Since we are working on a Natural Language Processing use-case, it is ideal that we use the Long Short Term Memory model (LSTM). LSTM networks are similar to RNNs with one major difference that hidden layer updates are replaced by memory cells. This makes them better at finding and exposing long-range dependencies in data which is imperative for sentence structures 

In [68]:
inp = Input(shape=(maxlen, )) #maxlen=200 as defined earlier
embed_size = 128
x = Embedding(max_features, embed_size)(inp)

In [69]:
x = LSTM(60, return_sequences=True,name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
#x = Dense(50, activation="sigmoid")(x) less perfermance
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)

In [70]:
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

In [71]:
batch_size = 32
epochs = 4
model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [72]:
model.summary()

In [73]:
print('Number of entries in each category:')
print('training: ', y_train.sum(axis=0))
print('validation: ', y_val.sum(axis=0))

#  Part 2 of the Toxic Comment Classifier project
* this Notebook aims to elaborate on the steps needed to successfully deploy a Deep Learning model on AWS EC2.
![](https://miro.medium.com/max/1002/1*tbmc4Zub9udxVp1twG43qQ.png)

# Problem Statement
**To develop a web application which accurately calculates the toxicity of a statement that has been provided as an input by the user.**

**Having created and generated all the requisite files for the deployment process, I downloaded the necessary applications needed to complete the deployment, created an account on AWS, and lastly, deployed my application using AWS EC2 instance. Every step followed to achieve successful deployment of my LSTM model will be elaborated in the upon below:**

**1 - Create an account on Amazon Web Services and log into your account. As soon as you log in, search for EC2 in the search bar that appears on top of the AWS Management Console. Upon choosing “EC2”, you are redirected to the EC2 management console wherein you can click on “Instances” from the Resources tab.**
![](https://miro.medium.com/max/1400/1*PRh9JVfUMlI_3f36WpKTbA.png)


**There is other steps for finishing the deployement but it is guided by the AWS platform For more details follows this link:**
[full guide](https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/)

# Conclusion
**Deploying the LSTM model using a cloud technology, AWS EC2 .
In our Case we try to build a model that detect toxic words from sentences even Our application(web for now) will be extended to become a google app that can use for many use cases like child protection or anti_bullying in social media**

# Database
* [toxic_database](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
* [glove](https://www.kaggle.com/authman/pickled-glove840b300d-for-10sec-loading)