In [2]:

import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

## Project 1 - NLP and Text Classification

For this project you will need to classify some angry comments into their respective category of angry. The process that you'll need to follow is (roughly):
<ol>
<li> Use NLP techniques to process the training data. 
<li> Train model(s) to predict which class(es) each comment is in.
    <ul>
    <li> A comment can belong to any number of classes, including none. 
    </ul>
<li> Generate predictions for each of the comments in the test data. 
<li> Write your test data predicitions to a CSV file, which will be scored. 
</ol>

You can use any models and NLP libraries you'd like. Think aobut the problem, look back to see if there's anything that might help, give it a try, and see if that helps. We've regularly said we have a "toolkit" of things that we can use, we generally don't know which ones we'll need, but here you have a pretty simple goal - if it makes it more accurate, it helps. There's not one specific solution here, there are lots of things that you could do. 

## Training Data

Use the training data to train your prediction model(s). Each of the classification output columns (toxic to the end) is a human label for the comment_text, assessing if it falls into that category of "rude". A comment may fall into any number of categories, or none at all. Membership in one output category is <b>independent</b> of membership in any of the other classes (think about this when you plan on how to make these predictions - it may also make it easier to split work amongst a team...). 

In [97]:
# loading training data:

train_df = pd.read_csv("train.csv.zip")
train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [98]:
train_df[train_df['toxic']==1].iloc[12]

id                                                003217c3eb469ba9
toxic                                                            1
severe_toxic                                                     0
obscene                                                          0
threat                                                           1
insult                                                           0
identity_hate                                                    0
Name: 79, dtype: object

In [99]:
train_df['comment_text'].iloc[12]



In [100]:
# shape of training data
train_df.shape

(159571, 8)

In [101]:
# sampling 500000 from training data
train = train_df.sample(50000)

## Test Data

In [102]:
#Loading testing data:

test_df = pd.read_csv("test.csv")
test_df.head()

Unnamed: 0,id,comment_text
0,1,Yo bitch Ja Rule is more succesful then you'll...
1,2,== From RfC == \n\n The title is fine as it is...
2,3,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,4,":If you have a look back at the source, the in..."
4,5,I don't anonymously edit articles at all.


In [25]:
# shape of testing data:
test_df.shape

(153164, 2)

In [103]:
test_df['comment_text'][2]


'" \n\n == Sources == \n\n * Zawe Ashton on Lapland —  /  "'

### Defining a preprocessing function:
The preprocess function is designed to preprocess individual text documents by removing unnecessary words, characters, and noise that may not be useful for downstream analysis.

In order to preprocess a large number of text documents in a structured way, it can be helpful to apply this function to a dataframe that contains a column of text documents. This allows you to preprocess all the documents at once using the same set of operations, which can save time and ensure consistency across the documents.

In [104]:
# Preprocess the text data
def preprocess(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    words = [w for w in words if not w in stop_words]
    # Join the words back into a string
    text = " ".join(words)
    # Remove numbers and other special characters
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    return text

### Applying Preprocess function to training  dataset:

applies the preprocess function to each row of the 'comment_text' column in the train dataframe, and saves the preprocessed text in a new column called 'clean_comment'.
<br> The resulting train dataframe now has two columns: 'comment_text' and 'clean_comment'. The 'comment_text' column contains the raw text data, while the 'clean_comment' column contains the preprocessed text data.

In [105]:
train['clean_comment'] = train['comment_text'].apply(preprocess)

### Split the data into training and testing sets

 The training set is used to train the model, while the testing set is used to evaluate the model's performance on new, unseen data.

In [95]:
# Split the data into training and testing sets
X = train['clean_comment']
y = train[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [96]:
train_df.head(2)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,new_column_name,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0


In [29]:
# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(X_train).toarray()
X_test = vectorizer.transform(X_test).toarray()


In [30]:
# Train a logistic regression classifier for each toxicity label
labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [31]:

from sklearn.metrics import classification_report, accuracy_score

# Train a logistic regression classifier for each toxicity label
labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
for label in labels:
    y_train_label = y_train[label]
    y_test_label = y_test[label]
    clf = LogisticRegression()
    clf.fit(X_train, y_train_label)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test_label, y_pred)
    print(f"Classification report for {label}:")
    print(classification_report(y_test_label, y_pred))
    print(f"Accuracy score: {accuracy}")


Classification report for toxic:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      9063
           1       0.94      0.52      0.67       937

    accuracy                           0.95     10000
   macro avg       0.94      0.76      0.82     10000
weighted avg       0.95      0.95      0.95     10000

Accuracy score: 0.9521
Classification report for severe_toxic:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      9893
           1       0.54      0.19      0.28       107

    accuracy                           0.99     10000
   macro avg       0.77      0.59      0.64     10000
weighted avg       0.99      0.99      0.99     10000

Accuracy score: 0.9896
Classification report for obscene:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      9477
           1       0.94      0.58      0.72       523

    accuracy                

In [32]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Load the trained classifiers
labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
classifiers = {}
for label in labels:
    clf = LogisticRegression()
    clf.fit(X_train, y_train[label])
    classifiers[label] = clf

In [55]:
test_df['clean_comment'] = test_df['comment_text'].apply(preprocess)

In [56]:
# Vectorize the text data using the trained vectorizer
vectorizer = TfidfVectorizer(max_features=5000)
X_test = vectorizer.fit_transform(test_df['clean_comment']).toarray()

In [57]:
# Predict the toxicity labels for the test data using the trained classifiers
pred_labels = {}
for label in labels:
    clf = classifiers[label]
    pred_labels[label] = clf.predict(X_test)

In [58]:
# Store the predicted labels in a file named out.csv
out_df = pd.DataFrame(pred_labels, columns=labels)
out_df.to_csv('Pre2.csv', index=False)


In [111]:
test_df[['id','comment_text']].iloc[1039]

id                                                           1040
comment_text    " \n\n == Illusions about ""standard Croatian"...
Name: 1039, dtype: object

In [60]:
out_df[out_df['toxic']==1]

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
41,1,0,1,0,1,0
391,1,0,0,0,0,0
665,1,0,0,0,0,0
1040,1,0,0,0,0,0
1390,1,0,0,0,0,0
...,...,...,...,...,...,...
152445,1,0,0,0,1,0
152511,1,0,1,0,0,0
152943,1,0,1,0,0,0
152987,1,0,1,0,0,0


In [61]:
out_df[out_df['severe_toxic']==1]

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
141860,1,1,1,0,1,0


In [82]:
test_df['comment_text'].iloc[141859]

'" \n : \n :*""I hope you die of something you ate""  \n :*""She got shot with the whore makeup gun?""  \n :*""Vani you ignorant slut""  \n :*""get some real dick in your boney ass diet""  \n :*""when I called you a \'dumb cunt\' I did not mean to imply you were a \'dumb woman\' [...] I think you are a dumb human being"" \n :*""you are a stupid female. kill yourself."" \n :*""This is just for you Food Babe you\'re an ugly twat"" (Image captioned ""What organ stays warm inside of a dead girl\'s body? My Dick"") \n :Top quality criticism from the scientific community m8...   "'

In [106]:
test_df['id'].iloc[141859]

141860

In [62]:
out_df[out_df['obscene']==1]

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
41,1,0,1,0,1,0
2908,1,0,1,0,0,0
3194,1,0,1,0,0,0
5608,1,0,1,0,0,0
5929,1,0,1,0,0,0
...,...,...,...,...,...,...
150474,1,0,1,0,1,0
151215,1,0,1,0,1,0
152511,1,0,1,0,0,0
152943,1,0,1,0,0,0


In [89]:
test_df['id'].iloc[152943]

152944

In [90]:

test_df['comment_text'].iloc[152943]

'" \n\n ::::::Fixed I think. Replaced ""entered airspace"" with ""attacked objectives"".  "'

In [63]:
out_df[out_df['threat']==1]

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate


In [64]:
out_df[out_df['insult']==1]

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
41,1,0,1,0,1,0
2946,1,0,0,0,1,0
13596,1,0,1,0,1,0
16162,1,0,1,0,1,0
16442,1,0,1,0,1,0
...,...,...,...,...,...,...
147611,1,0,0,0,1,0
148789,1,0,1,0,1,0
150474,1,0,1,0,1,0
151215,1,0,1,0,1,0


In [65]:
out_df[out_df['identity_hate']==1]

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
44188,1,0,0,0,1,1
59399,1,0,0,0,0,1


In [84]:
test_df['comment_text'].iloc[44188]

'" \n\n Goddamn it!  "'

In [85]:
test_df['comment_text'].iloc[59399]

'Stop with the vandalism, goddamn Olavete.'

## Output Details, Submission Info, and Example Submission

For this project, please output your predictions in a CSV file. The structure of the CSV file should match the structure of the example below. 

The output should contain one row for each row of test data, complete with the columns for ID and each classification.

Into Moodle please submit:
<ul>
<li> Your notebook file(s). I'm not going to run them, just look. 
<li> Your sample submission CSV. This will be evaluated for accuracy against the real labels; only a subset of the predictions will be scored. 
</ul>

It is REALLY, REALLY, REALLY important the the structure of your output matches the specifications. The accuracies will be calculated by a script, and it is expecting a specific format. 

### Sample Evaluator

The file prediction_evaluator.ipynb contains an example scoring function, scoreChecker. This function takes a sumbission and an answer key, loops through, and evaluates the accuracy. You can use this to verify the format of your submission. I'm going to use the same function to evaluate the accuracy of your submission, against the answer key (unless I made some mistake in this counting function).

In [None]:
#Construct dummy data for a sample output. 
#You won't do this part first, you have real data - I'm faking it. 
#Your data should have the same structure, so the CSV output is the same
dummy_ids = ["dfasdf234", "asdfgw43r52", "asdgtawe4", "wqtr215432"]
dummy_toxic = [0,0,0,0]
dummy_severe = [0,0,0,0]
dummy_obscene = [0,1,1,0]
dummy_threat = [0,1,0,1]
dummy_insult = [0,0,1,0]
dummy_ident = [0,1,1,0]
columns = ["id", "toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
sample_out = pd.DataFrame( list(zip(dummy_ids, dummy_toxic, dummy_severe, dummy_obscene, dummy_threat, dummy_insult, dummy_ident)),
                    columns=columns)
sample_out.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,dfasdf234,0,0,0,0,0,0
1,asdfgw43r52,0,0,1,1,0,1
2,asdgtawe4,0,0,1,0,1,1
3,wqtr215432,0,0,0,1,0,0


In [None]:
#Write DF to CSV. Please keep the "out.csv" filename. Moodle will auto-preface it with an identifier when I download it. 
#This command should work with your dataframe of predictions. 
sample_out.to_csv('out.csv', index=False)  

## Grading

The grading for this is split between accuracy and well written code:
<ul>
<li> 75% - Accuracy. The most accurate will get 100% on this, the others will be scaled down from there. 
<li> 25% - Code quality. Can the code be followed and made sense of - i.e. comments, sections, titles. 
</ul>