# Springboard--DSC Program

# Capstone Project 1 - Data Wrangling 
### by Ellen A. Savoye

The data collected is from a Kaggle competition, Jigsaw Unintended Bias in Toxicity Classification, via the Kaggle API. Of the 7 files in the zipped data, we will be focusing on the 'train' data. The original 'train' data is comprised of 45 columns containing information on toxicity and identity labels, comments, and metadata. 

### Import packages and data

In [1]:
# !pip install kaggle
# !pip install spacy
# !pip install spacymoji
# !pip install emot
# !pip install demoji
# !pip install swifter
# !pip install tqdm

In [2]:
import pandas as pd
import numpy as np

# libraries for NLP
import spacy

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import re
import unicodedata

import string
from string import punctuation

from emot.emo_unicode import UNICODE_EMO, EMOTICONS
import swifter

# libraries for getting and moving data
import os
from os import path
import shutil
from zipfile import ZipFile

# from kaggle.api.kaggle_api_extended import KaggleApi

In [3]:
# necessary dependencies for text pre-processing

nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
#nlp_vec = spacy.load('en_vecs', parse = True, tag=True, #entity=True)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

In [4]:
# Set directories
# src = "C:\\Users\\ellen\\Documents\\GitHub\\Data_Science_Career_Track\\Capstone_1\\Code\\"
# dst = "C:\\Users\\ellen\\Documents\\GitHub\\Data_Science_Career_Track\\Capstone_1\\Data\\"

# Work computer
src = "C:\\Users\\esavoye\\Documents\\GitHub\\Data_Science_Career_Track\\Capstone_1\\Code\\"
dst = "C:\\Users\\esavoye\\Documents\\GitHub\\Data_Science_Career_Track\\Capstone_1\\Data\\"

kaggle_comp_name = 'jigsaw-unintended-bias-in-toxicity-classification'
zipfile_name = kaggle_comp_name + '.zip'

csv_file = [i for i in os.listdir(dst) if i.startswith("train") and path.isfile(path.join(dst, i))]
zip_file = [i for i in os.listdir(dst) if i.startswith("jigsaw") and path.isfile(path.join(dst, i))]

if zip_file[0] != zipfile_name:
    #Import data from Kaggle API
    api = KaggleApi()
    api.authenticate()
    files = api.competition_download_files(kaggle_comp_name)
    
    # Move the jigsaw zip file to the Data folder
    files = [i for i in os.listdir(src) if i.startswith("jigsaw") and path.isfile(path.join(src, i))]
    for f in files:
        shutil.move(path.join(src, f), dst)
    
    # Check if Train data is already extracted
    if csv_file != 'train.csv':
        with ZipFile(dst + zipfile_name, 'r') as zipObj:
            # Extract all the contents of zip file in current directory
            print(zipObj.namelist())
            zipObj.extract('train.csv', path = dst)
else:
    print('Data is already downloaded')

Data is already downloaded


In [5]:
# Read in the train dataset
csv_filename = 'train.csv'
train_data = pd.read_csv(dst + csv_filename, low_memory=False)

# Output the number of rows
print("Total rows: {0}".format(len(train_data)))

# See which headers are available
print(list(train_data))

Total rows: 1804874
['id', 'target', 'comment_text', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual', 'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu', 'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability', 'jewish', 'latino', 'male', 'muslim', 'other_disability', 'other_gender', 'other_race_or_ethnicity', 'other_religion', 'other_sexual_orientation', 'physical_disability', 'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date', 'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow', 'sad', 'likes', 'disagree', 'sexual_explicit', 'identity_annotator_count', 'toxicity_annotator_count']


The metadata from Civil Comments platform is contained in the following columns: created_date, publication_id, parent_id, article_id, rating, funny, wow, sad, likes, and disagree.

I will be keeping these fields in for further exploration and possible use.

In [6]:
#function for counting null records:
def num_missing(x):
    return sum(x.isnull())

#Applying per column:
print("Missing values per column:")
print(train_data.apply(num_missing, axis=0))

Missing values per column:
id                                           0
target                                       0
comment_text                                 0
severe_toxicity                              0
obscene                                      0
identity_attack                              0
insult                                       0
threat                                       0
asian                                  1399744
atheist                                1399744
bisexual                               1399744
black                                  1399744
buddhist                               1399744
christian                              1399744
female                                 1399744
heterosexual                           1399744
hindu                                  1399744
homosexual_gay_or_lesbian              1399744
intellectual_or_learning_disability    1399744
jewish                                 1399744
latino                           

In [7]:
# Find percent of missing values for each column instead of number of records

percent_missing = train_data.isnull().sum() * 100 / len(train_data)
print(round(percent_missing,1))

id                                      0.0
target                                  0.0
comment_text                            0.0
severe_toxicity                         0.0
obscene                                 0.0
identity_attack                         0.0
insult                                  0.0
threat                                  0.0
asian                                  77.6
atheist                                77.6
bisexual                               77.6
black                                  77.6
buddhist                               77.6
christian                              77.6
female                                 77.6
heterosexual                           77.6
hindu                                  77.6
homosexual_gay_or_lesbian              77.6
intellectual_or_learning_disability    77.6
jewish                                 77.6
latino                                 77.6
male                                   77.6
muslim                          

In [8]:
# Data type of each column

train_data.dtypes

id                                       int64
target                                 float64
comment_text                            object
severe_toxicity                        float64
obscene                                float64
identity_attack                        float64
insult                                 float64
threat                                 float64
asian                                  float64
atheist                                float64
bisexual                               float64
black                                  float64
buddhist                               float64
christian                              float64
female                                 float64
heterosexual                           float64
hindu                                  float64
homosexual_gay_or_lesbian              float64
intellectual_or_learning_disability    float64
jewish                                 float64
latino                                 float64
male         

In [9]:
train_data.rating.unique()

array(['rejected', 'approved'], dtype=object)

In [10]:
train_data.shape

(1804874, 45)

There are 45 columns in the train dataframe. Of those 45, only three columns, 'comment_text', 'created_date', and 'rating', are objects. The remaining 42 are either float64 or int64. These columns are non-categorical. 'Comment_text' is categorical containing the individual comments that we need to analyze. 'Created_date' contains the original date the comments were created. 'Rating' is a categorical containing two values: rejected or approved. 

In [11]:
train_data.comment_text.head(5)

0    This is so cool. It's like, 'would you want yo...
1    Thank you!! This would make my life a lot less...
2    This is such an urgent design problem; kudos t...
3    Is this something I'll be able to install on m...
4                 haha you guys are a bunch of losers.
Name: comment_text, dtype: object

In [12]:
# View unique records in a particular column

#sorted(train_data.target.unique())
train_data.target.unique()

array([0.        , 0.89361702, 0.66666667, ..., 0.87726476, 0.01116838,
       0.87008821])

In [13]:
# Checking for blank string records
np.where(train_data.applymap(lambda x: x == ''))

(array([], dtype=int64), array([], dtype=int64))

In [14]:
# Checking the range of numerical columns
print('Minimum value: ')
train_data.iloc[:,:].min()

Minimum value: 


id                                                                                 59848
target                                                                                 0
comment_text                           Canada is north of the USA border,  its colde...
severe_toxicity                                                                        0
obscene                                                                                0
identity_attack                                                                        0
insult                                                                                 0
threat                                                                                 0
asian                                                                                  0
atheist                                                                                0
bisexual                                                                               0
black                

In [15]:
print('Maximum value: ')
train_data.iloc[:,:].max()

Maximum value: 


id                                                           6334010
target                                                             1
comment_text                                         🤣gotta love it!
severe_toxicity                                                    1
obscene                                                            1
identity_attack                                                    1
insult                                                             1
threat                                                             1
asian                                                              1
atheist                                                            1
bisexual                                                           1
black                                                              1
buddhist                                                           1
christian                                                          1
female                            

Toxicity and identity labels range from 0.0-1.0. The value represents the fraction of raters who believed the label fit the comment. Toxicity labels do not have any missing values. According to the competition details, a subset of comments have been labeled with a variety of identity attributes that have been mentioned in the comment. As such, every identity label is missing ~78% of the values per column. The subset comprises approximately 22% of the data. 

Two examples of how labeling works are as follows:
Example 1: 
    - Comment: I'm a white woman in my late 60's and believe me, they are not too crazy about me either!!
    - Toxicity Labels: All 0.0
    - Identity Mention Labels: female: 1.0, white: 1.0 (all others 0.0)

Example 2: 
    - Comment: Continue to stand strong LGBT community. Yes, indeed, you'll overcome and you have.
    - Toxicity Labels: All 0.0
    - Identity Mention Labels: homosexual_gay_or_lesbian: 0.8, bisexual: 0.6, transgender: 0.3 (all others 0.0)


'Target' is the toxicity label. 'Severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat', and 'sexual_explicit' are toxicity sub types. All toxicity labels can be converted to categorical variables by using >= 0.5 as a positive indicator (1). 

Aside from 'id', 'comment_text', 'identity_annotator_count' and 'toxicity_annotator_count', the same conversion can be applied to the remaining identity columns. 'Id' is a unique identifier for each comment but may not hold value to keep in the data frame. 'Identity_annotator_count' and 'toxicity_annotator_count' are metadata columns from Jigsaw and may not hold value either. However, I'm not removing them until I do my exploratory data analysis to determine if they offer valuable insights. 

# Text Pre-processing and Cleaning

Given the size of the original dataset, ~1.8M records, I took a 25% random sample to quantify and test some of the text cleaning and pre-processing ideas: testing for punctuation, emoticons, emoji, accented words, etc. 

In [16]:
# Creating a smaller 10% random sample data frame to test text cleaning/pre-processing

subset_train_data = train_data.sample(frac=0.10)

In [17]:
subset_train_data.shape

(180487, 45)

Prior to considering removing emojis and emoticons, I need to see how frequently they occur in my data.

In [18]:
#extracting the emojis
emojis_list = map(lambda x: ''.join(x.split()), UNICODE_EMO.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
subset_train_data['emoji'] = subset_train_data['comment_text'].str.findall(r)
#Emoji count
subset_train_data['emoji_count'] = subset_train_data['emoji'].swifter.apply(len)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=180487.0, style=ProgressStyle(descript…




In [19]:
print('Emoji percentage in sample data: ', round(subset_train_data['emoji_count'].sum() / subset_train_data['emoji_count'].count() * 100,4), '%')

Emoji percentage in sample data:  0.4117 %


In [20]:
#extracting the emoticons
emoticons_list = map(lambda x: ''.join(x.split()), EMOTICONS.keys())
r = re.compile('|'.join(re.escape(p) for p in emoticons_list))
subset_train_data['emoticons'] = subset_train_data['comment_text'].str.findall(r)
#Emoji count
subset_train_data['emoticons_count'] = subset_train_data['emoticons'].swifter.apply(len)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=180487.0, style=ProgressStyle(descript…




In [21]:
print('Emoticons percentage in sample data: ', round(subset_train_data['emoticons_count'].sum() / subset_train_data['emoticons_count'].count() * 100,4), '%')

Emoticons percentage in sample data:  4.0097 %


My sample data contains emoticons in approximately 4% of the data and emoji in .5%. Given the minimal presence of both emoji and emoticons, I'll be removing them both along with punctuation before moving on.

In [22]:
# Functions to remove emoji and emoticons
def remove_emoji(string):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)   

def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)

In [23]:
subset_train_data['comment_text_no_emo'] = subset_train_data['comment_text'].swifter.apply(remove_emoji)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=180487.0, style=ProgressStyle(descript…




In [24]:
subset_train_data['comment_text_no_emo'] = subset_train_data['comment_text_no_emo'].swifter.apply(remove_emoticons)

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=180487.0, style=ProgressStyle(descript…




With emoji and emoticons removed, we can remove punctuation, any remaining special characters, and convert to lowercase as we split our comments into individual words. I'm removing stop words on my subset of data for testing.

In [25]:
# Create function to strip punctuation
def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)

In [26]:
subset_train_data['comment_text_punct'] = subset_train_data['comment_text_no_emo'].swifter.apply(lambda x: strip_punctuation(x))

In [27]:
subset_train_data['comment_text_punct'].tail()

1495173    What a DISGRACE\nHERE THIS WILL CHEER YOU ALL ...
980787                       June skies welcome home Hokulea
1763540    They do have some famous scientists and even m...
1496259    So far priests should listen to their people a...
669486     So  has Wynne become a climate change denier  ...
Name: comment_text_punct, dtype: object

In [28]:
subset_train_data.comment_text_punct.head()

339423    Keep up the good work Mr Putin You mean the KG...
511599                                    Youre so straight
425109    Are you making a statement or asking a questio...
476594    Highly doubt it Conservatives dont need outsid...
378440    Most people dont drink to get drunk  the only ...
Name: comment_text_punct, dtype: object

In [29]:
#convert to lower case
subset_train_data['comment_text_lower'] = subset_train_data['comment_text_punct'].swifter.apply(lambda x: x.lower())

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=180487.0, style=ProgressStyle(descript…




In [30]:
# Function to find and remove any remaining non-alphanumeric characters
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [31]:
#remove non-alphanumeric characters remaining i.g. â€
subset_train_data['comment_text_lower'] = subset_train_data['comment_text_lower'].swifter.apply(lambda x: remove_special_characters(x))

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=180487.0, style=ProgressStyle(descript…




In [32]:
#split sentences into words
subset_train_data['comment_text_words'] = subset_train_data['comment_text_lower'].swifter.apply(lambda x: word_tokenize(x))

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=180487.0, style=ProgressStyle(descript…




In [33]:
subset_train_data['comment_text_words'].head()

339423    [keep, up, the, good, work, mr, putin, you, me...
511599                                [youre, so, straight]
425109    [are, you, making, a, statement, or, asking, a...
476594    [highly, doubt, it, conservatives, dont, need,...
378440    [most, people, dont, drink, to, get, drunk, th...
Name: comment_text_words, dtype: object

In [34]:
# function to remove stop words
stop_words = set(stopwords.words('english'))

def remove_stopwords(list_of_words):
    words = [w for w in list_of_words if not w in stop_words]
    return words
    

In [35]:
# remove stop words 
subset_train_data['comment_text_words'] = subset_train_data['comment_text_words'].swifter.apply(lambda x: remove_stopwords(x))

HBox(children=(FloatProgress(value=0.0, description='Pandas Apply', max=180487.0, style=ProgressStyle(descript…




In [36]:
subset_train_data['comment_text_words'].head()

339423    [keep, good, work, mr, putin, mean, kgb, man, ...
511599                                    [youre, straight]
425109    [making, statement, asking, question, poor, gr...
476594    [highly, doubt, conservatives, dont, need, out...
378440    [people, dont, drink, get, drunk, reason, smok...
Name: comment_text_words, dtype: object

A by-product of removing the stop words is inadvertently changing the context of some of the comments. For example, "...no one wants them to in the country anymore" changes to "'one', 'wants', 'country', 'anymore'". The context has changed slightly. In the function below, I have the removal of stop words as False for now.

In [37]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

# Applying cleaning to full data set

For ease, I'm creating a function that will apply each step of cleaning to my original dataset.

In [40]:
def normalize_corpus(corpus, emoji_removal=True, emoticon_removal=True, 
                     punctuation_removal=True, text_lower_case=True, special_char_removal=True, 
                     lemmatize=False, stopword_removal=False, remove_digits=False):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # remove emoji
        if emoji_removal:
            doc = remove_emoji(doc)
        # remove emoticon
        if emoticon_removal:
            doc = remove_emoticons(doc)
        # remove punctuation
        if punctuation_removal:
            doc = strip_punctuation(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
#         if lemmatize_text:
#             doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

In [41]:
# apply function normalize_corpus to full comment_text data

train_data['clean_text'] = normalize_corpus(train_data['comment_text'])

In [42]:
train_data['clean_text'].head()

0    this is so cool its like would you want your m...
1    thank you this would make my life a lot less a...
2    this is such an urgent design problem kudos to...
3    is this something ill be able to install on my...
4                  haha you guys are a bunch of losers
Name: clean_text, dtype: object

In [43]:
# create CSV with cleaned text column - 1.25 GB output

# train_data.to_csv('cleaned_train_data.csv')

In [None]:
# train_data.head().to_csv('test1.csv')

In [None]:
# subset_train_data.head().to_csv('test.csv')

I decided not to use lemmatization or POS tagging on my cleaned data. If I end up wanting to apply lemmatization or POS tagging, I will do so during or after exploratory data analysis (EDA). Tokenization will also happen during EDA.

In [44]:
# Create pickle file for use in the future

train_data.to_pickle(dst + '/cleaned_train_data.pkl')