# Data cleaning FINAL (Michael)

## Setup

In [1]:
# import the usual suspects / basics
import pandas as pd
import numpy as np
import re
import pickle
import os

# tqdm
from tqdm import tqdm
tqdm.pandas()

# spaCy
import spacy
#!python -m spacy download en_core_web_sm # must be run just once

# fastText
import fasttext

# display all df columns (default is 20)
pd.options.display.max_columns = None

# show all data in columns so that full comment is visible
pd.options.display.max_colwidth = None

## Load data

In [2]:
df = pd.read_csv('data/undersampled_data_60_40_ft.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360301 entries, 0 to 360300
Data columns (total 7 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   Unnamed: 0                360301 non-null  int64 
 1   comment_text              360301 non-null  object
 2   toxic                     360301 non-null  int64 
 3   stopwords_punct_lemma     360273 non-null  object
 4   toxic_label_ft            360301 non-null  object
 5   toxic_label_comment_text  360301 non-null  object
 6   vector_fast_text          360301 non-null  object
dtypes: int64(2), object(5)
memory usage: 19.2+ MB


## Optional: Create smaller sample from data to speed up things while experimenting

In [4]:
sample_size = None

# uncomment to create sample of desired size
#sample_size = 1_000

if sample_size != None:
    # ratio toxic/nontoxic
    tox_perc = 0.4
    nontox_perc = 0.6

    # number of toxic/nontoxic rows
    sample_size_tox = int(sample_size * tox_perc)
    sample_size_nontox = int(sample_size * nontox_perc)

    sample_tox = df[df['toxic'] == 1].sample(sample_size_tox,
                                             random_state=42)
    sample_nontox = df[df['toxic'] == 0].sample(sample_size_nontox,
                                                random_state=42)

    df = pd.concat([sample_tox, sample_nontox])
    print(f'Using sample ({df.shape[0]} rows).')

else:
    print(f'Using full data ({df.shape[0]} rows).')

Using full data (360301 rows).


## Create corpus

In [5]:
corp = df['comment_text']

## Data cleaning

### Show data size before cleaning

In [6]:
# count 'words' (rough regex method)
num_words_before = corp.str.count(r'\S+', flags=re.I).sum()

print(f'Number of words in corpus before cleaning: {num_words_before:,}')

Number of words in corpus before cleaning: 18,101,931


### Remove anchor HTML tags (\<a\>)

TODO: Do this with an HTML parser like Beautiful Soup.

In [7]:
regex = r'<a .*?>|</a>' # *? for non-greedy repetition

# count matches
print(corp.str.count(regex, flags=re.I).sum())

# show some rows containing the pattern
corp[corp.str.contains(regex, na=False, case=False)].head()

77


8286                                                                                                                                You can buy from our large and diverse collection of salwar kameez, party wear suits, bollywood collection, cotton kurtis, Anarrkali suits,Bollywood saree and many other products.....\nWe Have Some For You In Your Budget For more…\nPlz visit:- <a href= "http://www.dooiitt.com/">Designer Salwar Kameez</a>
28632                                                                                                                                                   <a href="http://www.newfitnessbooster.com/dermessence/">Dermessence</a> has most essential nutrients that this formula has and that act directly and indirectly in combating again signs from the inside out. for more information please visit http://www.newfitnessbooster.com/dermessence/
32256                                                                                                                       

In [8]:
# replace pattern
corp = corp.str.replace(regex, '', regex=True, case=False)

# count matches again, should be 0
print(corp.str.count(regex, flags=re.I).sum())

0


### Remove URLs

In [9]:
regex = r'https?://\S+'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

9725


3                      We are already owed $488 M plus interest($2Billion) from 2006 audits the state has not collected.\nhttps://www.adn.com/energy/article/oil-audit-draft/2014/11/20/\n\nThis amount of interest doesn't seem correct...\n\n'$416 million in taxes, plus another $368 million in interest between 2007 and 2009'\n\nWhen oil companies sued the state they wanted $100 M plus $400 M interest from 2006.\nhttps://www.adn.com/business-economy/energy/2016/12/16/state-wins-case-against-oil-companies-worth-an-estimated-500-million/\n\nIs the state interest rate is much lower than the one oil companies set for us, or the legislature is letting them off with only 3 years of interest?\n\n "The new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes."\nhttps://www.adn.com/opinions/2016/11/29/with-pfd-cut-on-the-line-oil-company-arguments-about-fine-points-of-tax-regs-will-backfire/
65                            

In [10]:
corp = corp.str.replace(regex, '', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Remove whitespace except for spaces

\r actually causes an error when loading the saved csv file with read_csv() (just C engine, Python engine works).  
\u2028 --> Unicode line seperator.

In [11]:
regex = r'[\t\n\r\f\v\u2028]'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

392656


1                                                                                                                                                                                                                                                                                                                                                                           The moment of critical mass is approaching when the deeds of Gupta & Co, like huge turbine engines slow down, halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room.\n\n‘...unintended consequences’…. uneasy sleep ahead for many.
2                                                                                                                                                                                                                                                                                                                                                                            

In [12]:
corp = corp.str.replace(regex, ' ', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Remove numbers

In [13]:
regex = r'\d+'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

168643


3     We are already owed $488 M plus interest($2Billion) from 2006 audits the state has not collected.   This amount of interest doesn't seem correct...  '$416 million in taxes, plus another $368 million in interest between 2007 and 2009'  When oil companies sued the state they wanted $100 M plus $400 M interest from 2006.   Is the state interest rate is much lower than the one oil companies set for us, or the legislature is letting them off with only 3 years of interest?   "The new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes." 
9                                                                                                                                                                                                                                                                                                        Why leave my basement? It's 1800 square feet with full bar stocked with Guinness

In [14]:
corp = corp.str.replace(regex, ' ', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Manually "unmask" morst frequent swearwords, insults etc. (e.g. f*ck, cr@p)

Also correct some (on-purpose) misspellings that reflect pronunciation, e.g. "huuuge", "stooopid".

TODO: Implement autocorrection.

In [15]:
# search patterns used to create list of replacements (see next cell)

regex = r'\S*\*\S+'
#regex = r'\S*@\S+'
#regex = r'\S*#\S+'
#regex = r'\S*a{3,}\S*'
#regex = r'\S*e{3,}\S*'
#regex = r'\S*i{3,}\S*'
#regex = r'\S*o{3,}\S*'
#regex = r'\S*u{3,}\S*'

print(corp.str.count(regex, flags=re.I).sum())
all_matches = corp.str.findall(regex, flags=re.I).value_counts()
all_matches[all_matches > 5]

3944


comment_text
[]              357089
[sh*t]              77
[***]               49
[a**]               44
[****]              36
[*****]             32
[s**t]              32
[f***]              27
[p***y]             25
[f**k]              24
[p****]             19
[p*ssy]             19
[s***]              17
[a**.]              17
[F***]              16
[h*ll]              16
[*is*]              14
[sh*t.]             12
[h***]              12
[*any*]             12
[*not*]             11
[sh**]              11
[pu**y]             11
[cr*p]              11
[F*ck]              10
[f***ing]           10
[**]                10
[*sigh*]            10
[***, ***]           9
[****, ****]         9
[*are*]              9
[s**t.]              9
[*&^%]               9
[*some*]             8
[a**es]              8
[b*tch]              8
[*only*]             8
[*ss]                8
[*you*]              8
[*lol*]              8
[*could*]            8
[f*ck]               8
[*did*]              

In [16]:
match_list = '(?i)f*ck, (?i)sh*t, (?i)s**t, (?i)f***, (?i)p***y, (?i)b*tch, (?i)f**k, (?i)p*ssy, (?i)p****, (?i)s***, (?i)a**, (?i)h*ll, (?i)h***, (?i)sh*t, (?i)pu**y, (?i)sh**, (?i)cr*p, (?i)@ss, (?i)cr@p, (?i)b@lls, (?i)f@ck, (?i)waaay, (?i)waaaay, (?i)riiiight, (?i)soo+, (?i)stooooopid, (?i)huu+ge, (?i)yuu+ge, (?i)suu+re'\
    .replace('*', r'\*').split(', ')
replace_list = 'fuck, shit, shit, fuck, pussy, bitch, fuck, pussy, pussy, shit, ass, hell, hell, shit, pussy, shit, crap, ass, crap, balls, fuck, way, way, right, so, stupid, huge, huge, sure'\
    .split(', ')

corp.replace(match_list, replace_list, regex=True, inplace=True)

### Remove multiple spaces

In [17]:
regex = r' {2,}'
print(corp.str.count(regex, flags=re.I).sum())
corp[corp.str.contains(regex, na=False, case=False)].head()

634227


1                                                                                                                                                                                                                                                                                                                                          The moment of critical mass is approaching when the deeds of Gupta & Co, like huge turbine engines slow down, halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room.  ‘...unintended consequences’…. uneasy sleep ahead for many.
2                                                                                                                                                                                                                                                                                                                                                                                                               

In [18]:
corp = corp.str.replace(regex, ' ', regex=True, case=False)
print(corp.str.count(regex, flags=re.I).sum())

0


### Show data size after cleaning

In [19]:
num_words_after = corp.str.count(r'\S+', flags=re.I).sum()

print(f'Number of words in corpus after cleaning: {num_words_after:,} (before: {num_words_before:,})')

Number of words in corpus after cleaning: 18,059,436 (before: 18,101,931)


## Preprocess data with spaCy (based on Eric's pipeline)

See: https://realpython.com/natural-language-processing-spacy-python/

TODO: Check if NLTK is faster.

In [20]:
# load English language model
nlp = spacy.load('en_core_web_sm')

### Tokenize, remove punctuation, make lower case, lemmatize, remove stop words

In [21]:
def preprocess(s):
    doc = nlp(s)
    
    tokens = [token.text.lower()
              for token in doc
              if not token.is_punct]
    
    tokens_lemma = [token.lemma_.lower()
              for token in doc
              if not token.is_punct]
    
    tokens_lemma_stop = [token.lemma_.lower()
              for token in doc
              if not token.is_punct and not token.is_stop]
    
    # convert lists to space-separated strings and return as Series
    return pd.Series([' '.join(tokens),
                      ' '.join(tokens_lemma),
                      ' '.join(tokens_lemma_stop)],
                      index=['clean_pp',
                             'clean_pp_lemma',
                             'clean_pp_lemma_stop'])

In [22]:
corp_pp = corp.progress_apply(preprocess)
corp_pp.head()

100%|██████████| 360301/360301 [1:05:28<00:00, 91.71it/s] 


Unnamed: 0,clean_pp,clean_pp_lemma,clean_pp_lemma_stop
0,well what are the chances he will turn out to have been an active proponent of slavery,well what be the chance he will turn out to have be an active proponent of slavery,chance turn active proponent slavery
1,the moment of critical mass is approaching when the deeds of gupta co like huge turbine engines slow down halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room unintended consequences uneasy sleep ahead for many,the moment of critical mass be approach when the deed of gupta co like huge turbine engine slow down halt and the reverse direction of the wheel of justice be set in motion leave no hiding room unintended consequence uneasy sleep ahead for many,moment critical mass approach deed gupta co like huge turbine engine slow halt reverse direction wheel justice set motion leave hiding room unintended consequence uneasy sleep ahead
2,hey listen to me he said i 'm not going to put up with your crap about all this he should n't have to prove himself to a reporter he said uh actually ben you do and you did n't buh bye,hey listen to i he say i be not go to put up with your crap about all this he should not have to prove himself to a reporter he say uh actually ben you do and you do not buh bye,hey listen say go crap prove reporter say uh actually ben buh bye
3,we are already owed $ m plus interest($ billion from audits the state has not collected this amount of interest does n't seem correct $ million in taxes plus another $ million in interest between and when oil companies sued the state they wanted $ m plus $ m interest from is the state interest rate is much lower than the one oil companies set for us or the legislature is letting them off with only years of interest the new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes,we be already owe $ m plus interest($ billion from audits the state have not collect this amount of interest do not seem correct $ million in taxis plus another $ million in interest between and when oil company sue the state they want $ m plus $ m interest from be the state interest rate be much low than the one oil company set for we or the legislature be let they off with only year of interest the new law include the unbelievable provision that after three year the company will pay zero additional interest on delinquent taxis,owe $ m plus interest($ billion audits state collect interest correct $ million taxis plus $ million interest oil company sue state want $ m plus $ m interest state interest rate low oil company set legislature let year interest new law include unbelievable provision year company pay zero additional interest delinquent taxis
4,there is a reason there are no teeth to the law it is an unlawful law there is no way anyone can be forced to give someone else free electricity not yet at least you want to be green pay for it yourself like every body else must,there be a reason there be no tooth to the law it be an unlawful law there be no way anyone can be force to give someone else free electricity not yet at least you want to be green pay for it yourself like every body else must,reason tooth law unlawful law way force free electricity want green pay like body


## Create new df with raw + cleaned + preprocessed comments + target

In [23]:
df_new = pd.concat([df['comment_text'],
                    corp,
                    corp_pp['clean_pp'],
                    corp_pp['clean_pp_lemma'],
                    corp_pp['clean_pp_lemma_stop'],
                    df['toxic']], axis=1)

# column names
df_new.columns = ['raw',
                  'clean',
                  'clean_pp',
                  'clean_pp_lemma',
                  'clean_pp_lemma_stop',
                  'toxic']

df_new.head()

Unnamed: 0,raw,clean,clean_pp,clean_pp_lemma,clean_pp_lemma_stop,toxic
0,"Well, what are the chances he will turn out to have been an active proponent of slavery?","Well, what are the chances he will turn out to have been an active proponent of slavery?",well what are the chances he will turn out to have been an active proponent of slavery,well what be the chance he will turn out to have be an active proponent of slavery,chance turn active proponent slavery,0
1,"The moment of critical mass is approaching when the deeds of Gupta & Co, like huge turbine engines slow down, halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room.\n\n‘...unintended consequences’…. uneasy sleep ahead for many.","The moment of critical mass is approaching when the deeds of Gupta & Co, like huge turbine engines slow down, halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room. ‘...unintended consequences’…. uneasy sleep ahead for many.",the moment of critical mass is approaching when the deeds of gupta co like huge turbine engines slow down halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room unintended consequences uneasy sleep ahead for many,the moment of critical mass be approach when the deed of gupta co like huge turbine engine slow down halt and the reverse direction of the wheel of justice be set in motion leave no hiding room unintended consequence uneasy sleep ahead for many,moment critical mass approach deed gupta co like huge turbine engine slow halt reverse direction wheel justice set motion leave hiding room unintended consequence uneasy sleep ahead,0
2,"""Hey listen to me,"" he said. ""I'm not going to put up with your crap about all this."" He shouldn't have to prove himself to a reporter, he said.\n\nUh, actually Ben, you do. And you didn't. Buh-bye.","""Hey listen to me,"" he said. ""I'm not going to put up with your crap about all this."" He shouldn't have to prove himself to a reporter, he said. Uh, actually Ben, you do. And you didn't. Buh-bye.",hey listen to me he said i 'm not going to put up with your crap about all this he should n't have to prove himself to a reporter he said uh actually ben you do and you did n't buh bye,hey listen to i he say i be not go to put up with your crap about all this he should not have to prove himself to a reporter he say uh actually ben you do and you do not buh bye,hey listen say go crap prove reporter say uh actually ben buh bye,1
3,"We are already owed $488 M plus interest($2Billion) from 2006 audits the state has not collected.\nhttps://www.adn.com/energy/article/oil-audit-draft/2014/11/20/\n\nThis amount of interest doesn't seem correct...\n\n'$416 million in taxes, plus another $368 million in interest between 2007 and 2009'\n\nWhen oil companies sued the state they wanted $100 M plus $400 M interest from 2006.\nhttps://www.adn.com/business-economy/energy/2016/12/16/state-wins-case-against-oil-companies-worth-an-estimated-500-million/\n\nIs the state interest rate is much lower than the one oil companies set for us, or the legislature is letting them off with only 3 years of interest?\n\n ""The new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes.""\nhttps://www.adn.com/opinions/2016/11/29/with-pfd-cut-on-the-line-oil-company-arguments-about-fine-points-of-tax-regs-will-backfire/","We are already owed $ M plus interest($ Billion) from audits the state has not collected. This amount of interest doesn't seem correct... '$ million in taxes, plus another $ million in interest between and ' When oil companies sued the state they wanted $ M plus $ M interest from . Is the state interest rate is much lower than the one oil companies set for us, or the legislature is letting them off with only years of interest? ""The new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes.""",we are already owed $ m plus interest($ billion from audits the state has not collected this amount of interest does n't seem correct $ million in taxes plus another $ million in interest between and when oil companies sued the state they wanted $ m plus $ m interest from is the state interest rate is much lower than the one oil companies set for us or the legislature is letting them off with only years of interest the new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes,we be already owe $ m plus interest($ billion from audits the state have not collect this amount of interest do not seem correct $ million in taxis plus another $ million in interest between and when oil company sue the state they want $ m plus $ m interest from be the state interest rate be much low than the one oil company set for we or the legislature be let they off with only year of interest the new law include the unbelievable provision that after three year the company will pay zero additional interest on delinquent taxis,owe $ m plus interest($ billion audits state collect interest correct $ million taxis plus $ million interest oil company sue state want $ m plus $ m interest state interest rate low oil company set legislature let year interest new law include unbelievable provision year company pay zero additional interest delinquent taxis,0
4,"There is a reason there are no teeth to the law. It is an unlawful law. There is no way anyone can be forced to give someone else free electricity. Not yet at least.\n\nYou want to be green , pay for it yourself like every body else must.","There is a reason there are no teeth to the law. It is an unlawful law. There is no way anyone can be forced to give someone else free electricity. Not yet at least. You want to be green , pay for it yourself like every body else must.",there is a reason there are no teeth to the law it is an unlawful law there is no way anyone can be forced to give someone else free electricity not yet at least you want to be green pay for it yourself like every body else must,there be a reason there be no tooth to the law it be an unlawful law there be no way anyone can be force to give someone else free electricity not yet at least you want to be green pay for it yourself like every body else must,reason tooth law unlawful law way force free electricity want green pay like body,0


## Drop rows with NaN's

In [24]:
rows_before = df_new.shape[0]
print("Rows with NaN's before dropping:", rows_before)
df_new.dropna(inplace=True)
df_new.reset_index(drop=True, inplace=True)
rows_after = df_new.shape[0]
print('Rows after:', rows_after)
print('Rows dropped:', rows_before - rows_after)

Rows with NaN's before dropping: 360301
Rows after: 360301
Rows dropped: 0


## Create fastText vectors

In [25]:
# # create temp file for fastText
# df_new.comment_clean_preproc.to_csv('data/fasttext_training_data_tmp.csv',
#                                     index=False, header=False)

# # run unsupervised learning to get embeddings
# ft = fasttext.train_unsupervised('data/fasttext_training_data_tmp.csv')

# # delete temp file
# os.remove('data/fasttext_training_data_tmp.csv')

In [26]:
# # add fastText vectors to df
# df_new['ft_vector'] = df_new['comment_clean_preproc']\
#     .map(ft.get_sentence_vector)

In [27]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360301 entries, 0 to 360300
Data columns (total 6 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   raw                  360301 non-null  object
 1   clean                360301 non-null  object
 2   clean_pp             360301 non-null  object
 3   clean_pp_lemma       360301 non-null  object
 4   clean_pp_lemma_stop  360301 non-null  object
 5   toxic                360301 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 16.5+ MB


## Save CSV file

In [28]:
df_new.to_csv('data/data_usampl_60_40_FINAL.csv', index=False)

In [29]:
pd.read_csv('data/data_usampl_60_40_FINAL.csv')

Unnamed: 0,raw,clean,clean_pp,clean_pp_lemma,clean_pp_lemma_stop,toxic
0,"Well, what are the chances he will turn out to have been an active proponent of slavery?","Well, what are the chances he will turn out to have been an active proponent of slavery?",well what are the chances he will turn out to have been an active proponent of slavery,well what be the chance he will turn out to have be an active proponent of slavery,chance turn active proponent slavery,0
1,"The moment of critical mass is approaching when the deeds of Gupta & Co, like huge turbine engines slow down, halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room.\n\n‘...unintended consequences’…. uneasy sleep ahead for many.","The moment of critical mass is approaching when the deeds of Gupta & Co, like huge turbine engines slow down, halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room. ‘...unintended consequences’…. uneasy sleep ahead for many.",the moment of critical mass is approaching when the deeds of gupta co like huge turbine engines slow down halt and the reverse direction of the wheels of justice are set in motion leaving no hiding room unintended consequences uneasy sleep ahead for many,the moment of critical mass be approach when the deed of gupta co like huge turbine engine slow down halt and the reverse direction of the wheel of justice be set in motion leave no hiding room unintended consequence uneasy sleep ahead for many,moment critical mass approach deed gupta co like huge turbine engine slow halt reverse direction wheel justice set motion leave hiding room unintended consequence uneasy sleep ahead,0
2,"""Hey listen to me,"" he said. ""I'm not going to put up with your crap about all this."" He shouldn't have to prove himself to a reporter, he said.\n\nUh, actually Ben, you do. And you didn't. Buh-bye.","""Hey listen to me,"" he said. ""I'm not going to put up with your crap about all this."" He shouldn't have to prove himself to a reporter, he said. Uh, actually Ben, you do. And you didn't. Buh-bye.",hey listen to me he said i 'm not going to put up with your crap about all this he should n't have to prove himself to a reporter he said uh actually ben you do and you did n't buh bye,hey listen to i he say i be not go to put up with your crap about all this he should not have to prove himself to a reporter he say uh actually ben you do and you do not buh bye,hey listen say go crap prove reporter say uh actually ben buh bye,1
3,"We are already owed $488 M plus interest($2Billion) from 2006 audits the state has not collected.\nhttps://www.adn.com/energy/article/oil-audit-draft/2014/11/20/\n\nThis amount of interest doesn't seem correct...\n\n'$416 million in taxes, plus another $368 million in interest between 2007 and 2009'\n\nWhen oil companies sued the state they wanted $100 M plus $400 M interest from 2006.\nhttps://www.adn.com/business-economy/energy/2016/12/16/state-wins-case-against-oil-companies-worth-an-estimated-500-million/\n\nIs the state interest rate is much lower than the one oil companies set for us, or the legislature is letting them off with only 3 years of interest?\n\n ""The new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes.""\nhttps://www.adn.com/opinions/2016/11/29/with-pfd-cut-on-the-line-oil-company-arguments-about-fine-points-of-tax-regs-will-backfire/","We are already owed $ M plus interest($ Billion) from audits the state has not collected. This amount of interest doesn't seem correct... '$ million in taxes, plus another $ million in interest between and ' When oil companies sued the state they wanted $ M plus $ M interest from . Is the state interest rate is much lower than the one oil companies set for us, or the legislature is letting them off with only years of interest? ""The new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes.""",we are already owed $ m plus interest($ billion from audits the state has not collected this amount of interest does n't seem correct $ million in taxes plus another $ million in interest between and when oil companies sued the state they wanted $ m plus $ m interest from is the state interest rate is much lower than the one oil companies set for us or the legislature is letting them off with only years of interest the new law includes the unbelievable provision that after three years the companies will pay zero additional interest on delinquent taxes,we be already owe $ m plus interest($ billion from audits the state have not collect this amount of interest do not seem correct $ million in taxis plus another $ million in interest between and when oil company sue the state they want $ m plus $ m interest from be the state interest rate be much low than the one oil company set for we or the legislature be let they off with only year of interest the new law include the unbelievable provision that after three year the company will pay zero additional interest on delinquent taxis,owe $ m plus interest($ billion audits state collect interest correct $ million taxis plus $ million interest oil company sue state want $ m plus $ m interest state interest rate low oil company set legislature let year interest new law include unbelievable provision year company pay zero additional interest delinquent taxis,0
4,"There is a reason there are no teeth to the law. It is an unlawful law. There is no way anyone can be forced to give someone else free electricity. Not yet at least.\n\nYou want to be green , pay for it yourself like every body else must.","There is a reason there are no teeth to the law. It is an unlawful law. There is no way anyone can be forced to give someone else free electricity. Not yet at least. You want to be green , pay for it yourself like every body else must.",there is a reason there are no teeth to the law it is an unlawful law there is no way anyone can be forced to give someone else free electricity not yet at least you want to be green pay for it yourself like every body else must,there be a reason there be no tooth to the law it be an unlawful law there be no way anyone can be force to give someone else free electricity not yet at least you want to be green pay for it yourself like every body else must,reason tooth law unlawful law way force free electricity want green pay like body,0
...,...,...,...,...,...,...
360296,Do you still beat your wife? Simple question.,Do you still beat your wife? Simple question.,do you still beat your wife simple question,do you still beat your wife simple question,beat wife simple question,0
360297,"The fascist dictator continues the insanity against all human and civil rights by the National Security State formerly the purview of Hitler, Stalin, Mao, Pol Pot, ad nauseum.","The fascist dictator continues the insanity against all human and civil rights by the National Security State formerly the purview of Hitler, Stalin, Mao, Pol Pot, ad nauseum.",the fascist dictator continues the insanity against all human and civil rights by the national security state formerly the purview of hitler stalin mao pol pot ad nauseum,the fascist dictator continue the insanity against all human and civil right by the national security state formerly the purview of hitler stalin mao pol pot ad nauseum,fascist dictator continue insanity human civil right national security state purview hitler stalin mao pol pot ad nauseum,1
360298,Sean Hannity is a lightweight foolish commentator on Fox News. He is in over his head in trying to act tough with the big boys.,Sean Hannity is a lightweight foolish commentator on Fox News. He is in over his head in trying to act tough with the big boys.,sean hannity is a lightweight foolish commentator on fox news he is in over his head in trying to act tough with the big boys,sean hannity be a lightweight foolish commentator on fox news he be in over his head in try to act tough with the big boy,sean hannity lightweight foolish commentator fox news head try act tough big boy,0
360299,There are a number of countries which make it impossible for their nationals to give up citizenship. Even if a new Cdn citizen wanted to give up their citizenship in their country of origin they may not be able to.,There are a number of countries which make it impossible for their nationals to give up citizenship. Even if a new Cdn citizen wanted to give up their citizenship in their country of origin they may not be able to.,there are a number of countries which make it impossible for their nationals to give up citizenship even if a new cdn citizen wanted to give up their citizenship in their country of origin they may not be able to,there be a number of country which make it impossible for their national to give up citizenship even if a new cdn citizen want to give up their citizenship in their country of origin they may not be able to,number country impossible national citizenship new cdn citizen want citizenship country origin able,0
