# Data cleaning (Michael)

Remove URLs etc.

## Setup

In [81]:
# import the usual suspects / basics
import time; full_run_time_start = time.time() # start timing exec right away
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import re

# scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report, f1_score,\
    accuracy_score, precision_score, recall_score, confusion_matrix

# XGBoost
from xgboost import XGBClassifier

# currently not used and thus commented out
# import nltk
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# display all df columns (default is 20)
pd.options.display.max_columns = None

## Load data

In [82]:
df = pd.read_csv('data/undersampled_data_60_40.csv')
df.shape

(360835, 6)

## Optional: Create smaller sample from data to speed up experiments

In [83]:
sample_size = None

# uncomment to create sample of desired size
#sample_size = 25_000

if sample_size != None:
    # ratio toxic/nontoxic
    tox_perc = 0.4
    nontox_perc = 0.6

    # number of toxic/nontoxic rows
    sample_size_tox = int(sample_size * tox_perc)
    sample_size_nontox = int(sample_size * nontox_perc)

    sample_tox = df[df['toxic'] == 1].sample(sample_size_tox,
                                             random_state=42)
    sample_nontox = df[df['toxic'] == 0].sample(sample_size_nontox,
                                                random_state=42)

    df = pd.concat([sample_tox, sample_nontox])
    print(f'Using sample ({df.shape[0]} rows).')

else:
    print(f'Using full data ({df.shape[0]} rows).')

Using full data (360835 rows).


## Data cleaning (URLs etc.)

In [84]:
pd.options.display.max_colwidth = None

corp_raw = df['comment_text']

### Remove URLs

In [85]:
regex = r'https?://\S+'
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())
corp_raw[corp_raw.str.contains(regex, na=False, case=False)].head()

9760


3                      We are already owed $488 M plus interest($2Billion) from 2006 audits the state has not collected.\nhttps://www.adn.com/energy/article/oil-audit-draft/2014/11/20/\n\nThis amount of interest doesn't seem correct...\n\n'$416 million in taxes, plus another $368 million in interest between 2007 and 2009'\n\nWhen oil companies sued the state they wanted $100 M plus $400 M interest from 2006.\nhttps://www.adn.com/business-economy/energy/2016/12/16/state-wins-case-against-oil-companies-worth-an-estimated-500-million/\n\nIs the state interest rate is much lower than the one oil companies set for us, or the legislature is letting them off with only 3 years of interest?\n\n "The new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes."\nhttps://www.adn.com/opinions/2016/11/29/with-pfd-cut-on-the-line-oil-company-arguments-about-fine-points-of-tax-regs-will-backfire/
65                            

In [86]:
corp_raw = corp_raw.str.replace(regex, '', regex=True, case=False)
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())

0


### Remove linebreaks

In [87]:
regex = r'\n'
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())
corp_raw[corp_raw.str.contains(regex, na=False, case=False)].head()

392196


1                                                                                                                                                                                                                                                                                                                                                                           The moment of critical mass is approaching when the deeds of Gupta & Co, like huge turbine engines slow down, halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room.\n\n‘...unintended consequences’…. uneasy sleep ahead for many.
2                                                                                                                                                                                                                                                                                                                                                                            

In [88]:
corp_raw = corp_raw.str.replace(regex, ' ', regex=True, case=False)
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())

0


### Remove HTML tags

In [89]:
regex = r'</?\S>'
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())
corp_raw[corp_raw.str.contains(regex, na=False, case=False)].head()

123


5268     430edward You wrote:  "The solution to the second question is for you to write a letter to the person you harmed or do a graveside amends."  But, what i wrote was:  "Or, what <b>if you are the hurt person</b> and by the time you come to the point of knowing to offer forgiveness <b>you can not because the person who hurt you is dead.""</b>  Lastly, you wrote:  "The solution to the first question is that it is not your business how the other takes your forgiveness, only that it is sincere."  However, when you forgive someone, you hope there will be <b>Reconciliation</b> so it is the business of the one who forgives.  Forgiveness without Reconciliation leaves a hollow spot in one's heart that never gets filled in.
8294                                                                                                                                                                                                                                                                          

In [90]:
corp_raw = corp_raw.str.replace(regex, '', regex=True, case=False)
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())

0


### Remove numbers

In [91]:
regex = r'\d+'
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())
corp_raw[corp_raw.str.contains(regex, na=False, case=False)].head()

168642


3     We are already owed $488 M plus interest($2Billion) from 2006 audits the state has not collected.   This amount of interest doesn't seem correct...  '$416 million in taxes, plus another $368 million in interest between 2007 and 2009'  When oil companies sued the state they wanted $100 M plus $400 M interest from 2006.   Is the state interest rate is much lower than the one oil companies set for us, or the legislature is letting them off with only 3 years of interest?   "The new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes." 
9                                                                                                                                                                                                                                                                                                        Why leave my basement? It's 1800 square feet with full bar stocked with Guinness

In [92]:
corp_raw = corp_raw.str.replace(regex, '', regex=True, case=False)
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())

0


### Remove multiple spaces

In [93]:
regex = r' {2,}'
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())
corp_raw[corp_raw.str.contains(regex, na=False, case=False)].head()

551214


1                                                                                                                                                                                                                                                                                                                               The moment of critical mass is approaching when the deeds of Gupta & Co, like huge turbine engines slow down, halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room.  ‘...unintended consequences’…. uneasy sleep ahead for many.
2                                                                                                                                                                                                                                                                                                                                                                                                     "Hey listen to me," h

In [94]:
corp_raw = corp_raw.str.replace(regex, ' ', regex=True, case=False)
print(corp_raw.str.count(regex, flags=re.IGNORECASE).sum())

0


In [95]:
corp_raw.sample(10)

339899                                                                                                                                                                                                                                                                                                         This ridiculous argument appeared on the new HART website recently. The Tax Foundation and independent CPAs have long established that this rail GET surcharge tax costs each individual on Oahu between $ and $ per year - for years. That is about $, for each couple. Enough - no more taxes for rail.
335832                                                                                                                                                                                                                                                                                                                                                                                                         