The first step in completing your capstone project is to collect data. Depending on your dataset, you may apply some of the data wrangling techniques that you learned in this unit. Some of you may be using standard datasets and sources, such as Kaggle or Yelp, where minimal or no data wrangling is required. Students often find that this part of the project takes a lot longer than they estimated, which is completely normal. The more work you put in, the more you’ll learn. Data wrangling is an important tool in a data scientist’s toolbox!  


Steps:

Create a Google Doc (1-2 pages) describing the data wrangling steps you took to clean the dataset. Include answers to these questions in your submission:

What kind of cleaning steps did you perform?

How did you deal with missing values, if any?

Were there outliers, and how did you handle them?

Submit a link to the document.

Discuss it with your mentor at the next call.

Revise and resubmit if needed.

Convert the final document to a .pdf and add it to your GitHub repository for this project. This document will eventually become part of your milestone report.

# Intro to data wrangling cell

In [1]:
#!pip install kaggle

In [2]:
import pandas as pd
import numpy as np
import json
import os
from os import path
import shutil
from zipfile import ZipFile

from kaggle.api.kaggle_api_extended import KaggleApi

In [3]:
#Import data from Kaggle API
api = KaggleApi()
api.authenticate()
files = api.competition_download_files("jigsaw-unintended-bias-in-toxicity-classification")

In [4]:
src = "C:\\Users\\ellen\\Documents\\GitHub\\Data_Science_Career_Track\\Capstone_1\\Code\\"
dst = "C:\\Users\\ellen\\Documents\\GitHub\\Data_Science_Career_Track\\Capstone_1\\Data\\"

files = [i for i in os.listdir(src) if i.startswith("jigsaw") and path.isfile(path.join(src, i))]
for f in files:
    shutil.move(path.join(src, f), dst)

In [12]:
zipfile_name = 'jigsaw-unintended-bias-in-toxicity-classification.zip'
with ZipFile(dst + zipfile_name, 'r') as zipObj:
    # Extract all the contents of zip file in current directory
    print(zipObj.namelist())
    zipObj.extract('train.csv', path = dst)

['identity_individual_annotations.csv', 'sample_submission.csv', 'test.csv', 'test_private_expanded.csv', 'test_public_expanded.csv', 'toxicity_individual_annotations.csv', 'train.csv']


In [13]:
# Read in the train dataset
csv_filename = 'train.csv'
train_data = pd.read_csv(dst + csv_filename, low_memory=False)

# Output the number of rows
print("Total rows: {0}".format(len(train_data)))

# See which headers are available
print(list(train_data))

Total rows: 1804874
['id', 'target', 'comment_text', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual', 'black', 'buddhist', 'christian', 'female', 'heterosexual', 'hindu', 'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability', 'jewish', 'latino', 'male', 'muslim', 'other_disability', 'other_gender', 'other_race_or_ethnicity', 'other_religion', 'other_sexual_orientation', 'physical_disability', 'psychiatric_or_mental_illness', 'transgender', 'white', 'created_date', 'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow', 'sad', 'likes', 'disagree', 'sexual_explicit', 'identity_annotator_count', 'toxicity_annotator_count']


The metadata from Civil Comments is contained in the following columns: created_date, publication_id, parent_id, article_id, rating, funny, wow, sad, likes, and disagree.

Given that we're not using Civil Comments' rating or tags, these columns will be removed before checking data types or for missing values.

In [21]:
slimmed_train_data = train_data.drop(['created_date', 'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow', 'sad', 'likes', 'disagree'], axis=1)

In [40]:
percent_missing = slimmed_train_data.isnull().sum() * 100 / len(slimmed_train_data)
print(round(percent_missing,1))

id                                      0.0
target                                  0.0
comment_text                            0.0
severe_toxicity                         0.0
obscene                                 0.0
identity_attack                         0.0
insult                                  0.0
threat                                  0.0
asian                                  77.6
atheist                                77.6
bisexual                               77.6
black                                  77.6
buddhist                               77.6
christian                              77.6
female                                 77.6
heterosexual                           77.6
hindu                                  77.6
homosexual_gay_or_lesbian              77.6
intellectual_or_learning_disability    77.6
jewish                                 77.6
latino                                 77.6
male                                   77.6
muslim                          

In [37]:
#function for counting null records:
def num_missing(x):
    return sum(x.isnull())

#Applying per column:
print("Missing values per column:")
print(slimmed_train_data.apply(num_missing, axis=0)) #axis=0 defines that function is to be applied on each column

Missing values per column:
id                                           0
target                                       0
comment_text                                 0
severe_toxicity                              0
obscene                                      0
identity_attack                              0
insult                                       0
threat                                       0
asian                                  1399744
atheist                                1399744
bisexual                               1399744
black                                  1399744
buddhist                               1399744
christian                              1399744
female                                 1399744
heterosexual                           1399744
hindu                                  1399744
homosexual_gay_or_lesbian              1399744
intellectual_or_learning_disability    1399744
jewish                                 1399744
latino                           

In [26]:
slimmed_train_data.dtypes

id                                       int64
target                                 float64
comment_text                            object
severe_toxicity                        float64
obscene                                float64
identity_attack                        float64
insult                                 float64
threat                                 float64
asian                                  float64
atheist                                float64
bisexual                               float64
black                                  float64
buddhist                               float64
christian                              float64
female                                 float64
heterosexual                           float64
hindu                                  float64
homosexual_gay_or_lesbian              float64
intellectual_or_learning_disability    float64
jewish                                 float64
latino                                 float64
male         

In [31]:
# #function for counting null records:
# # def unique_values(x):
    
# #     return sum(x.isnull())

# # #Applying per column:
# # print("Missing values per column:")
# # print(slimmed_train_data.apply(num_missing, axis=0)) #axis=0 defines that function is to be applied on each column

# slimmed_train_data.asian.unique()

Toxicity and identity labels range from 0.0-1.0. The value represents the fraction of raters who believed the label fit the comment. Toxicity labels do not have any missing values. According to the competition details, a subset of comments have been labelled with a variety of identity attributes that have been mentioned in the comment. As such, every identity label is missing ~78% of the values per column. The subset comprises approximately 12% of the data. 

 

For example (from Kaggle's competition details), 