# Toxic Code Review Model

For this we are going to use a dataset of code review comments that are labeled as 'toxic' or 'non-toxic'. The labels are binary, so 'toxic' is 1 and 'non-toxic' is 0. The dataset and paper can be found here:

https://github.com/WSU-SEAL/ToxiCR

We will need to:
1. Pre-process the data.
2. Train a model on part of the dataset.
3. Evaluate the model on the rest of the dataset.
4. (Next) Try the model on other datasets, or new code reviews on GitHub.
5. (Next) Try different types of models and compare them.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Load and examine the data.

Goal: For this section, we will want to load the data from the excel sheet, and learn a few basic statistics on the dataset.

1. Download the excel dataset at [code-review-dataset-full.xlsx](https://github.com/WSU-SEAL/ToxiCR/blob/master/models/code-review-dataset-full.xlsx).

2. Load the dataset into this colab. On the left side you will see a file icon on this page, click on it, and then press the file upload button and select the downloaded file.

3. Load the dataset into the runtime with code. Here we will use [pandas](https://pandas.pydata.org/docs/user_guide/index.html) again.

In [None]:
import pandas as pd
import numpy as np
import re
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer

In pandas, we work with objects called [DataFrames](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframee). These are two-dimensional (rows and columns, like matrices) structures that hold data. Load the data into a pandas DataFrame (hint: use [reading excel files](https://pandas.pydata.org/docs/user_guide/io.html#reading-excel-files)).

In [None]:
# Write a function that returns a dataframe from the excel file:
def get_dataframe_from_excel(file):
  review_data_file = pd.read_excel('code-review-dataset-full.xlsx')

  return review_data_file

# Note: you may need to update the name here
toxic_df = get_dataframe_from_excel('ode-review-dataset-full.xlsx')
toxic_df

Unnamed: 0,message,is_toxic
0,This and below assignments also should be removed,0
1,this should be flavor_id = self.flavor_id,0
2,bool session_adopted_ = false;,0
3,"nit: Starting C++11, this could be done direct...",0
4,I am confused.\n \n This is the tar process we...,0
...,...,...
19646,Amazing!!! I bet that contributed a lot to our...,0
19647,great catch,0
19648,"Wow, this is amazing.",0
19649,This is awesome.,0


### Basic statistics

Next, let's determine what percentage of the dataset contains toxic comments.

1. Find the total number of comments in the dataset. (Hint: this corresponds to the number of rows in the dataframe, you can find a list of helpful functions you can do on DataFrames [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)).

2. Find how many toxic comments there are. There are many ways to do this. Since toxic comments have 1's in their 'is_toxic' column, one way is to sum up the number of 1s in the columns of the dataframe. To get a list of all the values in the DataFrame, use [DataFrame.values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html#pandas-dataframe-values
).



In [None]:
# Let's examine some basic statistics.

# Return the total number of comments in the dataset.
def total_comments(dataframe):
  return len(dataframe)

# Return how many toxic comments there are in the database.
#   Hint: Iterate through the DataFrame.values and sum the 'is_toxic' value.
def toxic_comments(dataframe):
  return np.sum(dataframe['is_toxic']> 0)

# Calculate the percentage of toxic comments in the database.
print('{:02.4}% of the database is toxic comments'.format(100 * toxic_comments(toxic_df) / total_comments(toxic_df)))

19.12% of the database is toxic comments


Let's pick out some toxic comments and read them to see what they're like.

Iterate through the DataFrame and print out the first 5 toxic comments:

In [None]:
# Function to return the first 20 toxic comments
def first_toxic_comments(dataframe):
    # Filter the dataframe to get only the toxic comments
    toxic_comments = dataframe[dataframe['is_toxic'] == 1]
    # Select the first 20 toxic comments
    first_20_toxic_comments = toxic_comments['message'].head(20)
    return first_20_toxic_comments

# Example DataFrame for testing
# toxic_df = pd.DataFrame({'comment': [...], 'toxicity': [...]})

comments = first_toxic_comments(toxic_df)
for i in range(6):
    print('{0}. {1}'.format(i, comments.iloc[i]))


0. Yuck. Use %.70s, which will do this more gracefully
1. function keyword is a bashism and IMHO looks ugly compared to 'usage() {'.
2. For this, you'd just check the whether the size of |entries| is === 1. Another option coul dbe to move the checks to the caller, but that might be more clumsy.
3. In the future, we should look into seeing if we make this a little cleaner instead of a crap load of info() lines.
4. the name is ugly, what can't it be openstack-ec2-api-service?
5. lazy-lazy =)


What are some things you notice about the comments?

If you wanted to clean the data, what things would you do to make sure the data was uniform and easy for the model to analyze?

## Preprocess the dataset

Goal: In this section, we will want to perform some pre-processing on the code review text, so that we will the inputs to the model will be more uniform and the model will have a better chance and finding common patterns and context.

In the previous section, you saw some example of the code review comments. We may want to:
1. Lowercase everything
2. Remove punctuations, extra spaces and tabs.
3. Expand contractions ("isn't" to "is not").
4. "Normalize" words so that things like "working" and "work" are the same.
5. Remove words with little meaning like "the" and "as".
6. Convert all numbers to the same thing (like a word "num").

After this, we will also need some way to represent each list of normalized words in the code review as numerical data so that the machine learning model can understand it. There are many choices for how to do this (this is called "Feature Extraction") and we will focus on one method.

We will chunk these tasks up into three tasks:
1. Cleaning.
2. Word Stemming.
3. Feature Extraction.


## Data Cleaning

In this section, we will cover some of the tasks above.
* Lowercasing
* Remove punctuation and whitespace
* Convert all numbers to the word "num".
* Remove stopwords (e.g. "the", "and").

Hints:
* The Python string library provides a list of [strings.punctuation](https://docs.python.org/3/library/string.html#string.punctuation).
* You can match all numbers by using [regular expressions](https://www.regular-expressions.info/quickstart.html).
* The [nltk](https://www.nltk.org/index.html) package contains a list of stopwords by language.

In [None]:
# Clean Data

# Make it lowercase
toxic_df['message'] = toxic_df['message'].str.lower()

# Convert 'message' rows to strings
toxic_df['message'] = toxic_df['message'].astype(str)

# Expand contractions
contractions_dict = {
    "can't": "cannot",
    "won't": "will not",
    "n't": " not",
    "'re": " are",
    "'s": " is",
    "'d": " would",
    "'ll": " will",
    "'t": " not",
    "'ve": " have",
    "'m": " am"
}

contractions_re = re.compile('(%s)'%'|'.join(contractions_dict.keys()))

def expand_contractions(s, contractions_dict=contractions_dict):
  def replace(match):
    return contractions_dict[match.group(0)]
  return contractions_re.sub(replace, s)

# Expand contractions
toxic_df['message'] = toxic_df['message'].apply(lambda x: expand_contractions(x, contractions_dict))
print(toxic_df)

# Remove punctuation
toxic_df['message'] = toxic_df['message'].apply(lambda x: re.sub('[^A-Za-z0-9]', ' ', x))

# Remove whitespace
# Continue, but let's check if this one here is correct because I don't think it removed everything.
toxic_df['message'] = toxic_df['message'].apply(lambda x: re.sub('[\s]', ' ', x))


# Stemming the words
# Function to stem words in a message
# Initialize the SnowballStemmer

sb = SnowballStemmer('english')
def stem_message(message):
    words = message.split()  # Split message into words
    stemmed_words = [sb.stem(word) for word in words]  # Stem each word
    return ' '.join(stemmed_words)  # Join stemmed words back into a single string

# Apply the stemming
toxic_df['message'] = toxic_df['message'].apply(stem_message)
# You can check with the following code, but it will be a bunch of words
#print(toxic_df['message'])

# Remove stopwords (e.g. "the", "and")
# Initialize Scikit-learn stopword
stop = text.ENGLISH_STOP_WORDS

# Exclude the stopwords from the dataset
toxic_df['message'] = toxic_df['message'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


# Convert all numbers to the word "num".
# Function to replace numbers with the word "num"
def replace_numbers(text):
    return re.sub(r'\d+', 'num', text)

# Replace numbers with "num"
toxic_df['message'] = toxic_df['message'].apply(replace_numbers)


                                                 message  is_toxic
0      this and below assignments also should be removed         0
1              this should be flavor_id = self.flavor_id         0
2                         bool session_adopted_ = false;         0
3      nit: starting c++11, this could be done direct...         0
4      i am confused.\n \n this is the tar process we...         0
...                                                  ...       ...
19646  amazing!!! i bet that contributed a lot to our...         0
19647                                        great catch         0
19648                             wow, this is amazing.          0
19649                                   this is awesome.         0
19650  cannot wait for this! seeing the nuances there...         0

[19651 rows x 2 columns]


In [None]:
class FeatureExtractor:
    def __init__(self):
      self.model = None
      return

    def extract(self, stemmed_messages):
    # Create a TF-IDF model and run fit if we don't have one
      if self.model == None:
        tfidf = TfidfVectorizer() #Create object
        result = tfidf.fit_transform(stemmed_messages) # Get tf-df values

        # Save the model into the feature extractor
        self.model = tfidf
        return result

      # Extract features from the messages with the model
      extracted = self.model.transform(stemmed_messages)
      return extracted

In [None]:
# Get the stemmed messages from the toxic_df
stemmed_messages = toxic_df['message']

# Create the feature extractor and extract
feature_extractor = FeatureExtractor()
feature_vectors = feature_extractor.extract(stemmed_messages)
print(feature_vectors)


  (0, 8438)	0.45761601691523784
  (0, 571)	0.8891499204648408
  (1, 9003)	0.24273815730946038
  (1, 4528)	0.5032997501018988
  (1, 3452)	0.8293175197313595
  (2, 3279)	0.36923245911686126
  (2, 144)	0.6046753782950489
  (2, 9069)	0.5192926350015039
  (2, 1015)	0.47788098649292343
  (3, 10296)	0.4229515539093423
  (3, 2252)	0.44245992588223965
  (3, 2479)	0.3670971798703047
  (3, 6893)	0.21041683246905823
  (3, 9641)	0.33617520601322637
  (3, 6710)	0.270058227916316
  (3, 3279)	0.311973539554391
  (3, 1015)	0.4037733388839359
  (4, 9831)	0.28799543170788383
  (4, 3411)	0.28657051389284244
  (4, 1665)	0.15673743534323625
  (4, 2224)	0.24556050590890605
  (4, 259)	0.36662802661700356
  (4, 5339)	0.15810886362325724
  (4, 6602)	0.12920040373092392
  (4, 11278)	0.15057895619212588
  :	:
  (19643, 5824)	0.5319857492223976
  (19643, 4082)	0.5319857492223976
  (19643, 3249)	0.5638279922379742
  (19643, 3428)	0.34070562134702675
  (19644, 3288)	0.7803398243316005
  (19644, 7386)	0.6253557056285

#Training


In [None]:
from sklearn.svm import SVC

# Train the SVC model.
# Get the is_toxic row
is_toxic = toxic_df['is_toxic']
model = SVC(kernel='linear', probability=True).fit(feature_vectors, is_toxic)

In [None]:
def predict(message, model, feature_extractor):
  # Extract features
  feature_message = feature_extractor.extract(message)

  # predict hate speech messages
  pred = model.predict(feature_message)
  score = model.predict_proba(feature_message)
  return pred, score

message = ["this is toxic code"] # toxic (1) .5
print(predict(message, model, feature_extractor))

(array([0]), array([[0.92554887, 0.07445113]]))
