# MALIGNANT COMMENTS CLASSIFICATION:

Data Set Description The data set contains the training set, which has approximately 1,59,000 samples and the test set which contains nearly 1,53,000 samples. All the data samples contain 8 fields which includes ‘Id’, ‘Comments’, ‘Malignant’, ‘Highly malignant’, ‘Rude’, ‘Threat’, ‘Abuse’ and ‘Loathe’.
The label can be either 0 or 1, where 0 denotes a NO while 1 denotes a YES. There are various comments which have multiple labels. The first attribute is a unique ID associated with each comment.

In [None]:
# importing all required libraries for the project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import scipy as stats
import nltk
from nltk.corpus import stopwords
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import wordnet
from nltk.corpus import wordnet as wn
import re
import string
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [None]:
# lets import train csv file to jupyter notebook
df=pd.read_csv("train.csv")
df

In [None]:
# lets import test csv file to jupyter notebook
df_test=pd.read_csv("test.csv")
df_test

# EDA(Exploratory Data Analysis) :

In [None]:
# lets check the names of the columns present in the train dataset
df.columns

  - Here we can see all the names of the columns present in our train dataset with Malignant as our target column.


In [None]:
# lets check the names of the columns present in the test dataset
df_test.columns

  - Here we can see the names of the columns present in our test dataset

In [None]:
# lets check shape of the train dataset
df.shape

  - Here we can see that there are 1,59,571 rows present in 8 columns of train dataset.

In [None]:
# lets check the shape of the test dataset
df_test.shape

  - Here we can see that there are 1,53,164 rows present in 2 columns of the test dataset

In [None]:
# lets check the information about the train dataset
df.info()

  - Here we can see that there are no Null values present in our train dataset

In [None]:
# lets check the information regarding the test dataset
df_test.info()

  - Here we can see that there are no null values present in the test dataset as well.

In [None]:
# lets check the value counts of all the columns in the train dataset
df.value_counts()

In [None]:
df['malignant'].value_counts()

  - Here 0 denotes No and 1 denotes Yes. so most of the messages are not Malignant.

In [None]:
df['highly_malignant'].value_counts()

  - Here also we can see very few messages are Highly Malignant.

In [None]:
df['rude'].value_counts()

  - Few of the messages are rude

In [None]:
df['threat'].value_counts()

  - Here we can see very few messages have threat content.

In [None]:
df['abuse'].value_counts()

  - Here we can see few messages have abusive language.

In [None]:
df['loathe'].value_counts()

  - Here we can see that there are few messages have loathe or disgusting language.

In [None]:
# lets check the datatype of all the columns present in the train dataset
df.dtypes

  - Here we can see that there are two types of dtype present in the train dataset i.e. object and integer dtype.
  - Here we can see that there is 1st column name id, id's are unique for all the comments dataset and it wont help in our model building, it will make the model more complex and less accurate. so, we must drop this column.

In [None]:
# lets drop the column id from train dataset
df.drop('id',axis=1,inplace=True)

  - Here we have successfully dropped the column id from our train dataset

In [None]:
# lets check the presence of null value once again in train dataset
df.isnull().sum()

  - Here we are again confirmed that there are no null values present in this train dataset.

In [None]:
# lets check few comments present in the train dataset
df['comment_text'][9]

In [None]:
df['comment_text'][27]

In [None]:
df['comment_text'][5]

In [None]:
df['comment_text'][117]

  - Here after observing some comments, We can clearly see that there is a need of text processing as there are many numbers, alphabets and special characters present in the comments which are not important or required for our model.

In [None]:
# lets create a new column showing length of words in comment_text in train dataset
df['before_clean']=df['comment_text'].map(lambda comment_text: len(comment_text))
df


In [None]:
# Lets create a new column named before clean showing no. of words present in comment_text column in test dataset
df_test['before_clean']=df_test['comment_text'].map(lambda comment_text: len(comment_text))
df_test

# Text Processing:

In [None]:
# lets download latest updated stopwords and wordnet.
import nltk
nltk.download('wordnet')

In [None]:
nltk.download('stopwords')

In [None]:
import nltk
nltk.download('punkt')

In [None]:
stop_words=stopwords.words('english')
lemmatizer=wordnet.WordNetLemmatizer()
# lets clean the messages and remove or replace some words
def edited(text):
    #convert to lower case
    lowered_text = text.lower()
    
    #Replacing email addresses with 'emailaddress'
    text = re.sub(r'^.+@[^\.].*\.[a-z]{2,}$', 'emailaddress', lowered_text)
    
    #Replace URLs with 'webaddress'
    text = re.sub(r'http\S+', 'webaddress', text)
    
    #Removing numbers
    text = re.sub(r'[0-9]', " ", text)
    
    #Removing the HTML tags
    text = re.sub(r"<.*?>", " ", text)
    
    #Removing Punctuations
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\_',' ',text)
    
    #Removing all the non-ascii characters 
    clean_words = re.sub(r'[^\x00-\x7f]',r'', text)
    
    #Removing the unwanted white spaces
    text = " ".join(text.split()) 
    
    # lets remove '\n' in comment_text
    text= re.sub(r'\n',' ',text)
    
    #Splitting data into words
    tokenized_text = word_tokenize(text)
    
    #Removing remaining tokens that are not alphabetic, Removing stop words and Lemmatizing the text
    removed_stop_text = [lemmatizer.lemmatize(word) for word in tokenized_text if word not in stop_words if word.isalpha()]
   
    return " ".join(removed_stop_text)

In [None]:
#Calling the above function for the column comment_text in training dataset to replace original with cleaned text
df['comment_text'] = df['comment_text'].apply(edited)
df['comment_text']

In [None]:
#Creating a column 'len_after_cleaning'
#Representing the length of the each comment respectively in a column 'comment_text' after cleaning the text.
df['len_after_cleaning'] = df['comment_text'].map(lambda comment_text: len(comment_text))
df

In [None]:
# lets import wordcloud to jupyter notebook
!pip install wordcloud

In [None]:
import wordcloud
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
def wcloud(df, label):
    
    # lets print only rows where the label value is 1 (ie. where comment is harsh)
    subset=df[df[label]==1]
    text=subset.comment_text.values
    wc= WordCloud(background_color="black",max_words=4500)

    wc.generate(" ".join(text))

    plt.figure(figsize=(27,27))
    plt.subplot(221)
    plt.axis("off")
    plt.title("Words frequented in {}".format(label), fontsize=18)
    plt.imshow(wc.recolor(colormap= 'gist_earth' , random_state=244))

In [None]:
df_m=df.loc[:,['comment_text','malignant']]
wcloud(df_m,'malignant')

In [None]:
df_hm=df.loc[:,['comment_text','highly_malignant']]
wcloud(df_hm,'highly_malignant')

In [None]:
df_r=df.loc[:,['comment_text','rude']]
wcloud(df_r,'rude')

In [None]:
df_t=df.loc[:,['comment_text','threat']]
wcloud(df_t,'threat')

In [None]:
df_a=df.loc[:,['comment_text','abuse']]
wcloud(df_a,'abuse')

In [None]:
df_l=df.loc[:,['comment_text','loathe']]
wcloud(df_l,'loathe')

# Visualization:

In [None]:
# lets plot all features using countplot
feat=df.columns[1:]
for col in feat:
    sns.countplot(df[col])
    plt.show()

  - Here in the first graph of malignant we can clearly observe that most of the messages are not malignant.
  - In the second image we can clearly observe that there are very less highly malignant messages.
  - Same in third picture there are few rude comments in the dataset.
  - In 4th we can clearly see that there are very few cases/almost negligible of threat comments
  - In 5th image we can clearly see that there are some messages with abusive language.
  - While in the sixth image we can clearly see that there are very few cases of loathe messages.
  - In 7th image we can see the no. of words in each rows
  - In 8th image we can see the cleaned no. of remaining words in each row.

In [None]:
# lets create a list of feature columns
featu=['malignant','highly_malignant','rude','threat','abuse','loathe']

In [None]:
# lets store the no. of counts for every target
counts=df[featu].iloc[:,0:].sum()
counts

In [None]:
# lets plot and visualize count of each columns
plt.figure(figsize=(18,9))
ax=sns.barplot(counts.index,counts.values)
plt.title("Total no. of messages in each columns")
plt.ylabel('Freq', fontsize=9)
plt.xlabel('Columns',fontsize=9)
rects=ax.patches
labels=counts.values
for rect, label in zip(rects, labels):
    height=rect.get_height()
    ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center',va='bottom' )
plt.show()

  - Here we can clearly see that maximum no. of messages were sent in malignant messages category, followed by rude and abuse.

In [None]:
#lets check the distribution of data using distplot
for col in df[featu].describe().columns:
    sns.distplot(df[featu][col])
    plt.show()

  - Here we can see data is skewed towards the right in all the columns.

In [None]:
# lets check the statistical description of all the columns
df.describe()

  - Here we can see that only 2 values are present in all the columns i.e. 0 and 1.
  - Low score of standard devaiation tells us that the data is not spreaded.
  - there is difference in mean and median which tells us that some skewness is present.
  - very low difference in 75% and max shows that there are no outliers present in the dataset.


In [None]:
# lets check the correlation amoung all the columns 
df.corr()

In [None]:
# lets visualize correlation using heatmap
plt.figure(figsize=(9,8))
sns.heatmap(df.corr(),linewidth=0.5, linecolor='black',fmt='.0%',annot=True)
plt.show()

# Data Pre-Processing

In [None]:
# lets create label column in train dataset
c_label= ['malignant','highly_malignant','rude','threat','abuse','loathe']
df[c_label].sum()

In [None]:
df['label']=df[c_label].sum(axis=1)
df.head()

In [None]:
#lets Check the count of labels
plt.figure(figsize=(10,6))
sns.countplot(df['label'], palette='coolwarm')
plt.title('Counting the labels',fontsize=27)
plt.show()

# Model building:

## Vectorizer

In [None]:
# lets convert text data using TfidfVectorizer
# lets import library for vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
tfidf=TfidfVectorizer(max_features = 14000, stop_words='english')

In [None]:
#Let's Separate the input and output variables represented by X and y respectively in train data and convert them
X = tfidf.fit_transform(df['comment_text'])

In [None]:
# lets first convert features into number vectors
y=df['label']

In [None]:
# lets check the shape of the dataset
print(X.shape,'\t\t',y.shape)

In [None]:
#Doing the above process for test data 
test_vec = tfidf.fit_transform(df_test['comment_text'])
test_vec

In [None]:
test_vec.shape

In [None]:
length = []
exclamation = []
question = []

for i in df.length:
   length.append([i])
for i in df.exclamation:
   exclamation.append([i])
for i in df.question:
   question.append([i])

# Building the model:

In [None]:
#Splitting the training and testing data 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)

In [None]:
#Checking the shape of x data
print(x_train.shape,'\t\t',x_test.shape)

In [None]:
#Checking the shape of y data
print(y_train.shape,'\t',y_test.shape)

# Model Selection:

In [None]:
#Importing required libraries
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, GridSearchCV 
from sklearn.metrics import f1_score,precision_score, multilabel_confusion_matrix, accuracy_score,jaccard_score, recall_score, hamming_loss
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score


In [None]:
#Initializing the instance of the model
svc = LinearSVC()
lr = LogisticRegression(solver='lbfgs')
mnb = MultinomialNB()
lgb = LGBMClassifier()
sgd = SGDClassifier()
rf = RandomForestClassifier()

In [None]:
def print_score(y_pred,clf):
    print('classifier:',clf.__class__.__name__)
    print("Jaccard score: {}".format(jaccard_score(y_test,y_pred,average='micro')))
    print("Accuracy score: {}".format(accuracy_score(y_test,y_pred)))
    print("f1_score: {}".format(f1_score(y_test,y_pred,average='micro')))
    print("Precision : ", precision_score(y_test,y_pred,average='micro'))
    print("Recall: {}".format(recall_score(y_test,y_pred,average='micro')))
    print("Hamming loss: ", hamming_loss(y_test,y_pred))
    print("Confusion matrix:\n ", multilabel_confusion_matrix(y_test,y_pred))
    print('========================================\n')    

In [None]:
#models with evaluation using OneVsRestClassifier
for classifier in [svc,lr,mnb,sgd,lgb,rf]:
   clf = OneVsRestClassifier(classifier)
   clf.fit(x_train,y_train)
   y_pred = clf.predict(x_test)
   print_score(y_pred, classifier)

  - we have got LinearSVC as our best model, so we will perform hyper parameter tuning on LinearSVC and try to increase its accuracy score.

# Hyper parameter tuning:

In [None]:
#Creating parameter list to pass in GridSearchCV
param = {
        'estimator__penalty': ['l1'],
        'estimator__loss': ['hinge','squared_hinge'],
        'estimator__multi_class': ['ovr','crammer_singer'],
        'estimator__dual': [False],
        'estimator__intercept_scaling': [2,4,5],
        'estimator__C': [2]
        }

In [None]:
from sklearn.model_selection import GridSearchCV
svc = OneVsRestClassifier(LinearSVC())
GCV =  GridSearchCV(svc,param,cv = 3, verbose =0,n_jobs=-1)
GCV.fit(x_train,y_train)

In [None]:
GCV.best_params_

# Final Model:

In [None]:
model = OneVsRestClassifier(LinearSVC(C=2,dual = False, loss='hinge',multi_class='crammer_singer', penalty ='l1',intercept_scaling=2))
model.fit(x_train,y_train)
y_pred = model.predict(x_test)

print("Jaccard score: {}".format(jaccard_score(y_test,y_pred,average='micro')))
print("Accuracy score: {}".format(accuracy_score(y_test,y_pred)))
print("f1_score: {}".format(f1_score(y_test,y_pred,average='micro')))
print("Precision : ", precision_score(y_test,y_pred,average='micro'))
print("Recall: {}".format(recall_score(y_test,y_pred,average='micro')))
print("Hamming loss: ", hamming_loss(y_test,y_pred))
print("\nConfusion matrix: \n", multilabel_confusion_matrix(y_test,y_pred))


  - Here we have successfully improved slightly prediction score from 91.76 to 91.77%.

In [None]:
lsvc_prediction=model.predict(X)
#Making a dataframe of predictions
malignant_prediction=pd.DataFrame({'Predictions':lsvc_prediction})
malignant_prediction

# Saving Our Best Model:

In [None]:
#Saving the model
import pickle
filename='MalignantCommentsClassifier.pkl'
pickle.dump(model,open(filename,'wb'))

In [None]:
#Checking our vectorized test data
test_vec

In [None]:
#Loading the model
fitted_model=pickle.load(open('MalignantCommentsClassifier.pkl','rb'))
fitted_model

# Prediction using test dataset:

In [None]:
#Test predictions
test_results=pd.DataFrame(test_df)
test_results.to_csv('Malignant_TestDataPredictions.csv')

In [None]:
#Train predictions
malignant_prediction.to_csv('Malignant_TrainDataPredictions.csv')

In [None]:
#Lets load the test data set
test = pd.read_csv('test.csv')
test.head()

In [None]:
#Lets load the test data set
test = pd.read_csv('test.csv')
test.head()

In [None]:
#Predictions
test_prediction=model.predict(test_vec)
test_df=pd.DataFrame({'Predictions':test_prediction})
test_df

In [None]:
# lets save the predictions
test_results=pd.DataFrame(test_df)
test_results.to_csv('Malignant_TestDataPredictions.csv')

# saving the predictions

In [None]:
malignant_prediction.to_csv('Malignant_DataPredictions.csv')

  - Finally, we had predicted over the test data and the predictions obtained were saved in a csv file.