In [None]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
    horizontal-align: middle;
}
h1,h2 {
    text-align: center;
    background-color: pink;
    padding: 20px;
    margin: 0;
    color: white;
    font-family: ariel;
    border-radius: 80px
}

h3 {
    text-align: center;
    border-style: solid;
    border-width: 3px;
    padding: 12px;
    margin: 0;
    color: black;
    font-family: ariel;
    border-radius: 80px;
    border-color: gold;
}

body, p {
    font-family: ariel;
    font-size: 15px;
    color: charcoal;
}
div {
    font-size: 14px;
    margin: 0;

}

h4 {
    padding: 0px;
    margin: 0;
    font-family: ariel;
    color: purple;
}
</style>
""")

# Detailed EDA and using TF-IDF with Logistic Regression

### (: <span style="color:green">WELCOME</span> :)

![jig](https://user-images.githubusercontent.com/74188336/142691516-4b0fee38-6c8b-4204-8b1f-c1d8d1144161.jpeg)

### Overview:

This notebook is to visualise the text data to see and identify some patterns in the text data which might help us in differentiating between `less_toxic` and `more_toxic` comments.

I have used my custom designed dataset made from the previous Jigsaw Toxic Comment Classification Challenge. The making of the dataset has been detailed in this discussion.

I wont be training any Deep Learning Model in this notebook. But if you want to train and take reference you can use any of my notebooks:

I have used different techniques:
* Proposed by @debarshichanda: Using RankingLoss Function to train a transformer.  [🎃 BERT | FIT | ES and Visualisation 📈](https://www.kaggle.com/kishalmandal/bert-fit-es-and-visualisation): This notebook can be used to train that model and [JRSTC | BERT | INFER 🎃](https://www.kaggle.com/kishalmandal/jrstc-bert-infer) is the inference kernel. Just need to change the model name. No need to change the hidden_nodes everytime you switch between base and large models (since it used `nn.LazyLinear()` layer as the transformer head)

* Used the Toxic Comment Classification Dataset and trained a multi-headed (6-heads) to classify the comments in those 6 categories. Then summed up the output probabilities and used those probabilities to rank the comments. You can find the training kernel [here](https://www.kaggle.com/kishalmandal/jigsaw-fit-multi-label-comment-classifier) and the inference kernel [here](https://www.kaggle.com/kishalmandal/infer-toxiccomments).

Now back to this notebook :)

This notebook is focused on the different visualisation, some important plots (including uni-grams, bi-grams, tri-grams), some box-plots, distribution plots. A detailed discussion on TF-IDF vectorizer. Then using TF-IDF vectorizer to train a Logistic Regression model (5 folds) and make a submission.


## Please <span style="color:white">upvote</span>  if it helps you or if you like it :) It's free to <span style="color:white">upvote</span> :) <span style="color:white">:)</span> :)

# : <span style="color:white">Contents</span> :

### 1. Some basic pre-processing    
### 2. n-grams (with visualizations) 
### 3. WordClouds 😃
### 4. Box Plots 
### 5. Distribution plots
### 6. Trying another approach
### 7. TF-<span style="color:purple">IDF</span>
### 8. Training using Logistic Regression Model
### 9. Visualization of `more-toxic` words
### 10. Calculating score on `validation_data.csv`
### 11. <span style="color:purple">Maximum</span> score that can be obtained on `validation_data.csv`
### 12. Submission time 🎃

In [None]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import STOPWORDS, WordCloud, ImageColorGenerator
from collections import defaultdict
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from PIL import Image
import seaborn as sns
import string
import plotly.figure_factory as ff
import random


In [None]:
df = pd.read_csv('../input/toxic-comments-train/training_data.csv')

In [None]:
df.head()

# 1. Some basic pre-processing:

Though it says some basic pre-processing, here I have just converted into lower case and removed some extra spaces.... :D

And currently working of correcting mis-spelled words and some shortened words...

Need to increase my vocab...never heard of these `toxic` words :D

### 1.1. Converting to lowercase and removing extra spaces

In [None]:
df["less_toxic"] = df['less_toxic'].apply(
    lambda x: ' '.join([w for w in str(x).lower().split()])
)

df["more_toxic"] = df['more_toxic'].apply(
    lambda x: ' '.join([w for w in str(x).lower().split()])
)


In [None]:
df.head()

### 1.2. Working on shortened and correcting mispelled words

In [None]:
df["less_toxic"] = df["less_toxic"].str.replace('fk', 'fuck')
df["less_toxic"] = df["less_toxic"].str.replace('fuk', 'fuck')

In [None]:
df.head()

# 2. n-grams

### What are n-grams?

In the fields of computational linguistics and probability, an n-gram (sometimes also called Q-gram) is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.

#### **n-gram generator**

In [None]:
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

#### N number of `n-grams` to visualize

In [None]:
N = 20

### 2.1. uni-grams

Unigrams are single words in a sentence. It's the smallest unit of word measurement.

for e.g. 

sentence : `'Hello I am the Leader of the Nazis'`

The unigrams are: `Hello`, `I`, `am`, `the`, `Leader`, `of`, `the`, `Nazis`

In [None]:
less_toxic_unigrams = defaultdict(int)
for tweet in df['less_toxic']:
    for word in generate_ngrams(tweet, 1):
        less_toxic_unigrams[word] += 1
        
df_less_toxic_unigrams = pd.DataFrame(sorted(less_toxic_unigrams.items(), key=lambda x: x[1])[::-1])

unigrams_less_100 = df_less_toxic_unigrams[:N]

more_toxic_unigrams = defaultdict(int)
for tweet in df['more_toxic']:
    for word in generate_ngrams(tweet, 1):
        more_toxic_unigrams[word] += 1
        
df_more_toxic_unigrams = pd.DataFrame(sorted(more_toxic_unigrams.items(), key=lambda x: x[1])[::-1])

unigrams_more_100 = df_more_toxic_unigrams[:N]

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(18, N//2), dpi=100)
plt.tight_layout()

sns.barplot(y=unigrams_less_100[0], x=unigrams_less_100[1], ax=axes[0], color='green')
sns.barplot(y=unigrams_more_100[0], x=unigrams_more_100[1], ax=axes[1], color='red')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=13)

axes[0].set_title(f'Top {N} most common unigrams in less_toxic comments', fontsize=15)
axes[1].set_title(f'Top {N} most common unigrams in more_toxic comments', fontsize=15)

plt.show()

### 2.2. bi-grams

Bi-grams are two words zipped together. If we iterate through each word in a sentence, then the pair of that word and the next word is called a bi-gram.

Let's take the previous sentence as example:

sentence : `'Hello I am the Leader of the Nazis'`

The bi-igrams are: `Hello I`, `I am`, `am the`, `the Leader`, `Leader of`, `of the`, `the Nazis`

In [None]:
less_toxic_bigrams = defaultdict(int)
for tweet in df['less_toxic']:
    for word in generate_ngrams(tweet, 2):
        less_toxic_bigrams[word] += 1
        
df_less_toxic_bigrams = pd.DataFrame(sorted(less_toxic_bigrams.items(), key=lambda x: x[1])[::-1])

bigrams_less_100 = df_less_toxic_bigrams[:N]

more_toxic_bigrams = defaultdict(int)
for tweet in df['more_toxic']:
    for word in generate_ngrams(tweet, 2):
        more_toxic_bigrams[word] += 1
        
df_more_toxic_bigrams = pd.DataFrame(sorted(more_toxic_bigrams.items(), key=lambda x: x[1])[::-1])

bigrams_more_100 = df_more_toxic_bigrams[:N]

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(18, N//2), dpi=100)
plt.tight_layout()

sns.barplot(y=bigrams_less_100[0], x=bigrams_less_100[1], ax=axes[0], color='green')
sns.barplot(y=bigrams_more_100[0], x=bigrams_more_100[1], ax=axes[1], color='red')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=13)

axes[0].set_title(f'Top {N} most common bigrams in less_toxic comments', fontsize=15)
axes[1].set_title(f'Top {N} most common bigrams in more_toxic comments', fontsize=15)

plt.show()

### 2.3. Tri-Grams

Similarly, the tri-grams would be 3 consecutive words in a sentence. 

sentence : `'Hello I am the Leader of the Nazis'`

The tri-grams are: `Hello I am`, `I am the`, `am the Leader`, `the Leader of`, `Leader of the`, `of the Nazis`

Similarly we can go on calculating n-grams :)

In [None]:
less_toxic_trigrams = defaultdict(int)
for tweet in df['less_toxic']:
    for word in generate_ngrams(tweet, 3):
        less_toxic_trigrams[word] += 1
        
df_less_toxic_trigrams = pd.DataFrame(sorted(less_toxic_trigrams.items(), key=lambda x: x[1])[::-1])

trigrams_less_100 = df_less_toxic_trigrams[:N]

more_toxic_trigrams = defaultdict(int)
for tweet in df['more_toxic']:
    for word in generate_ngrams(tweet, 3):
        more_toxic_trigrams[word] += 1
        
df_more_toxic_trigrams = pd.DataFrame(sorted(more_toxic_trigrams.items(), key=lambda x: x[1])[::-1])

trigrams_more_100 = df_more_toxic_trigrams[:N]

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(30, N//2), dpi=100)
plt.tight_layout()

sns.barplot(y=trigrams_less_100[0], x=trigrams_less_100[1], ax=axes[0], color='green')
sns.barplot(y=trigrams_more_100[0], x=trigrams_more_100[1], ax=axes[1], color='red')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=13)

axes[0].set_title(f'Top {N} most common trigrams in less_toxic comments', fontsize=35)
axes[1].set_title(f'Top {N} most common trigrams in more_toxic comments', fontsize=35)

plt.show()

# 3. WordClouds



WordCloud is a great way of visualizing the occurances of the most common words. The font size of the words depend on the occurance of that particular word in the whole corpus.

### 3.1 `less_toxic` comments WordCloud visualization

Let's visualise the the word cloud of the `less_toxic` corpus. You can also project the wordcloud on a mask as well as select any colormap. 

In [None]:
import requests
from io import BytesIO
try:
    url="https://user-images.githubusercontent.com/74188336/142692890-641ebc21-2e47-4556-9d37-1c0b9e1a0587.jpeg"
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))

    text = ' '.join(df['less_toxic'].values)
    mask = np.array(img)
    wordcloud = WordCloud(max_font_size=50, max_words=1000, background_color="white", mask=mask, colormap='BuGn').generate(text.lower())
    plt.figure(figsize=(15,15))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()
except Exception as e:
    pass

### 3.2. `more_toxic` comments WordCloud visualization

Now let's visualise the `more_toxic` comments....

Oh my.... even the wordcloud was able to determine the comments were toxic... See for yourself 😱

In [None]:

try:
    text = ' '.join(df['more_toxic'].values)
    url="https://user-images.githubusercontent.com/74188336/142692894-c17240e4-1101-4591-9d10-71793e460816.jpeg"
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))

    mask = np.array(img)
    wordcloud = WordCloud(max_font_size=50, max_words=2000, background_color="white", mask=mask, contour_width=0, contour_color='grey', colormap='Reds').generate(text.lower())
    plt.figure(figsize=(15,15))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()
except Exception as e:
    pass

In [None]:
less_toxic_words = [len(sentence.split(' ')) for sentence in df['less_toxic'].values]
more_toxic_words = [len(sentence.split(' ')) for sentence in df['more_toxic'].values]

less_toxic_chars = [len(sentence) for sentence in df['less_toxic'].values]
more_toxic_chars = [len(sentence) for sentence in df['more_toxic'].values]

less_toxic_punct = [len([char for char in sentence if char in string.punctuation]) for sentence in df['less_toxic'].values]
more_toxic_punct = [len([char for char in sentence if char in string.punctuation]) for sentence in df['more_toxic'].values]


# 4. Box Plots

Let's do some box plots for the number in the `less_toxic` and `more_toxic` comments to see the IQR. This will help us to determine the `mex_len` we will be using to pad/truncate the comments in our DeepNLP model (using BERT/RoBERTa)

Link to the notebook :

Similarly, let's do :
* Box plot for the number of characters
* Box plot for the number of punctuations.

This punctuation part plays a dual role :(

Sometimes some punctuations are important and sometimes unecessary..

For e.g.: `What!!!!!???!?!?!?!`

Here the punctuations emphasize the toxicity of the sentence.

Sometimes unecessary punctuations just lengthen the sentence which might result in truncation of the important part...

For e.g.: `"""""""""""""""""""""What!!!>??!L{">">`

### 4.1. Box plot of Number of words

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y=less_toxic_words, name = 'less_toxic',))
fig.add_trace(go.Box(y=more_toxic_words, name = 'more_toxic'))

fig.show()

### 4.2. Box Plot of Number of characters

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y=less_toxic_chars, name = 'less_toxic',))
fig.add_trace(go.Box(y=more_toxic_chars, name = 'more_toxic'))

fig.show()

### 4.3. Box Plot of Number of Punctuations

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y=less_toxic_punct, name = 'less_toxic',))
fig.add_trace(go.Box(y=more_toxic_punct, name = 'more_toxic'))

fig.show()

# 5. Distribution Plots

Let us also observe the distribution of the number of words, number of characters and number of punctuations just as we did for the box-plots. The box-plots give an idea of the distribution of the values, but still let's observe the distribution for further clarification about the values.

I will be taking a 10% random sample to plot because the original dataset has 100,000 comment pairs (just to make it faster)

### 5.1. Distribution of the number of words

**Note: The value in the y-axis doesn't show the actual count but rather the fraction of count.**

In [None]:
less_toxic_words_plot = random.sample(less_toxic_words, 10000)
more_toxic_words_plot = random.sample(more_toxic_words, 10000)
hist_data = [less_toxic_words_plot, more_toxic_words_plot]

X = ['less_toxic', 'more_toxic']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, X, show_hist=False)
fig.show()

### 5.2. Distribution of number of characters

In [None]:
import plotly.figure_factory as ff
import numpy as np
import random

less_toxic_chars_plot = random.sample(less_toxic_chars, 10000)
more_toxic_chars_plot = random.sample(more_toxic_chars, 10000)
hist_data = [less_toxic_chars_plot, more_toxic_chars_plot]

X = ['less_toxic', 'more_toxic']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, X, show_hist=False)
fig.show()

### 5.3. Distribution of the number of puncutation

In [None]:
import plotly.figure_factory as ff
import numpy as np
import random

less_toxic_punct_plot = random.sample(less_toxic_punct, 10000)
more_toxic_punct_plot = random.sample(more_toxic_punct, 10000)
hist_data = [less_toxic_punct_plot, more_toxic_punct_plot]

X = ['less_toxic', 'more_toxic']

# Create distplot with custom bin_size
fig = ff.create_distplot(hist_data, X, show_hist=False)
fig.show()

# 6. Trying another approach cuz I keep running out of GPU :)

I have labelled the `less_toxic` comments as `0` and the `more_toxic` as `1`. This is not exactly the correct process. This dataset was purposely made to train of ranking loss function. But I thought why not give it a try :)

It will be giving very close results as both categories have similar kind of sentences. In this approach I have also visualised the weights of the positive (`more_toxic`) words and the negative (`less_toxic`) words. I know I have reversed it :)

Please bear with it :D

In [None]:
less_toxic_df = pd.DataFrame()
less_toxic_df['comment'] = df['less_toxic']
less_toxic_df['target'] = len(df['less_toxic'])*[0]

In [None]:
more_toxic_df = pd.DataFrame()
more_toxic_df['comment'] = df['more_toxic']
more_toxic_df['target'] = len(df['more_toxic'])*[1]

In [None]:
final_df = pd.concat([less_toxic_df, more_toxic_df])

In [None]:
final_df = final_df.sample(frac=0.1, random_state=2).reset_index(drop=True)

In [None]:
final_df.head()

# 7. TF-IDF

### 7.1. Learning about the tf-idf vectorizer.

**tf : Term Frequency**


$$tf(t,d)= \frac{\text{count of the word t in d}}{\text{total number of words in d}}$$


**df : Document Frequency**

This measures the importance of documents in a whole set of the corpus. This is very similar to TF but the only difference is that TF is the frequency counter for a term t in document d, whereas DF is the count of occurrences of term t in the document set N. In other words, DF is the number of documents in which the word is present. We consider one occurrence if the term is present in the document at least once, we do not need to know the number of times the term is present.

$$df(t) = \text{occurrence of t in N documents}$$

To keep this also in a range, we normalize by dividing by the total number of documents. Our main goal is to know the informativeness of a term, and DF is the exact inverse of it. that is why we inverse the DF


**idf : inverse document frequency**

IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because they are present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.

$$\text{df}(t) = \frac{\text{N}}{\text{df}}$$


Now there are few other problems with the IDF, when we have a large corpus size say N=10000, the IDF value explodes. So to dampen the effect we take the log of IDF

$$\text{idf}(t) = \log\frac{N}{df + 1}$$

Finally, by taking a multiplicative value of TF and IDF, we get the TF-IDF score. 

$$\text{tf-idf}(t, d) = \text{tf(t, d)}\times\log\frac{N}{df + 1}$$

Previously thought of using the count vectorizer, but TF-IDF vectorizer automatically drops the weights of the words that doesn't contribute to the classification. That's why we won't need to remove the stopwords.

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(final_df, test_size=0.2, random_state=2)

### 7.2. Vectorizing using `TfidfVectorizer`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vec = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
tfidf_vec.fit_transform(train_df['comment'].values.tolist() + test_df['comment'].values.tolist())
train_tfidf = tfidf_vec.transform(train_df['comment'].values.tolist())
test_tfidf = tfidf_vec.transform(test_df['comment'].values.tolist())

# 8. Training using Logistic Regression Model

I have used simple linear regression model, since it is the fastest and will help in determining the weights of the words that contribute to the classification. 

I have plotted the words with their weights below.

Also, here I have used accuracy_score as the metric, which is not at all the correct metric :( but the way I designed this problem does the word :D

I have printed the OOF accuracy scores and also the accuracy score on the whole test set after this

In [None]:
from sklearn import metrics, model_selection, linear_model

train_y = train_df["target"].values

def runModel(train_X, train_y, test_X, test_y, test_X2):
    model = linear_model.LogisticRegression(C=5., solver='sag')
    model.fit(train_X, train_y)
    pred_test_y = model.predict(test_X)#[:,1]
    pred_test_y2 = model.predict(test_X2)#[:,1]
    return pred_test_y, pred_test_y2, model

print("Building model.")
cv_scores = []
pred_full_test = 0
pred_train = np.zeros([train_df.shape[0]])
kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2017)

all_models=[]
fold=0
for dev_index, val_index in kf.split(train_df):
    
    print('-'*50)
    print('Fold :', fold)
    print('-'*50)


    dev_X, val_X = train_tfidf[dev_index], train_tfidf[val_index]
    dev_y, val_y = train_y[dev_index], train_y[val_index]
    
    print('Training..')
    pred_val_y, pred_test_y, model = runModel(dev_X, dev_y, val_X, val_y, test_tfidf)
    
    all_models.append(model)
    
    pred_full_test = pred_full_test + pred_test_y
    pred_train[val_index] = pred_val_y
    cv_scores.append(metrics.accuracy_score(val_y, pred_val_y))
    print(f'Accuracy Score for fold {fold}:', metrics.accuracy_score(val_y, pred_val_y))
    fold+=1

In [None]:
preds = [round(y) for y in pred_full_test/5]

In [None]:
print('Accuracy on test set:', metrics.accuracy_score(test_df.target, preds))

The accuracy on the test set isn't bad. But the results will change when we will predict the probailities :( since the all the sentences are toxic........

Even the models don't like toxicity -_-

# 9. Visualisation of `more_toxic` words

The weights determine the **severity** of toxicity. The words in `red` are **lesser** toxic words and the words in `green` are **more** toxic words.

In [None]:
import eli5
eli5.show_weights(model, vec=tfidf_vec, top=100, feature_filter=lambda x: x != '<BIAS>')

# 10. Calculating Score on Validation Data

In [None]:
valdf = pd.read_csv('../input/jigsaw-toxic-severity-rating/validation_data.csv')

In [None]:
comment1 = valdf['less_toxic'].values
comment2 = valdf['more_toxic'].values

In [None]:
comm1 = tfidf_vec.transform(comment1)
comm2 = tfidf_vec.transform(comment2)

In [None]:
pred1 = []
pred2 = []
for model in all_models:
    pred1.append(np.array(model.predict_proba(comm1)[:,1]))
    pred2.append(np.array(model.predict_proba(comm2)[:,1]))

In [None]:
pred1 = sum(pred1)/5
pred2 = sum(pred2)/5

In [None]:
pred1

In [None]:
val_score=[]
for s1, s2 in zip(pred1, pred2):
    if s1<s2:
        val_score.append(1)
    else:
        val_score.append(0)
        
print('Validation Score :',np.mean(val_score))

# 11. Maximum Score that can be obtained on `validation_data.csv`

The maximum score that can be obtained on `validation_data.csv` can be calculated by the following code in the cell below:

You can find about this discussion [here](https://www.kaggle.com/c/jigsaw-toxic-severity-rating/discussion/287350) and the code was proposed by [yuval reina](https://www.kaggle.com/yuval6967) 

In [None]:
gp1=valdf.groupby(['less_toxic','more_toxic']).worker.count().reset_index()
gp2=gp1.copy()
gp2['less_toxic']=gp1['more_toxic']
gp2['more_toxic']=gp1['less_toxic']
mrg=gp1.merge(gp2,how='outer',on=['less_toxic','more_toxic']).fillna(0)
mrg['sum']=mrg.worker_x+mrg.worker_y
mrg['max']=mrg[['worker_x','worker_y']].max(1)
print('Maximum Score :', mrg['max'].sum()/mrg['sum'].sum())

The Validation Score isn't great compared to transformers or Naive-Bayes Model. But can't exactly tell it's bad because the maximum score that can be obtained from the validation set is : **0.824**

In [None]:
print('Effective accuracy:', np.round(np.mean(val_score)/0.824*100, 2),'%')

### Not bad..

Without any pre-processing and just using Logistic Regression gets about an effective 75% accuracy.. :D
Seems like TF-IDF works great. Let's make a submission now :)

# 12. Submission Time 🎃

In [None]:
sub = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')

comms = tfidf_vec.transform(sub['text'].values)
sub_preds = []
for model in all_models:
    sub_preds.append(np.array(model.predict_proba(comms)[:,1]))

sub['score'] = sum(sub_preds)
sub[['comment_id', 'score']].to_csv('submission.csv', index=False)

In [None]:
sub[['comment_id', 'score']].head()

# (: <span style="color:white">Thank you for reading</span> :)

# <span style="color:white">o.O</span> Please DO <span style="color:white">UPVOTE</span> if you find it useful <span style="color:white">O.o</span>

### References:

* [Simple Exploration Notebook - QIQC](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) 
* [Just some simple EDA](https://www.kaggle.com/tunguz/just-some-simple-eda)
* [NLP with Disaster Tweets - EDA, Cleaning and BERT](https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert)