<a href="https://colab.research.google.com/github/Tstrebe2/predicting-text-difficulty/blob/dave-updates/code/dave_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

In [2]:
train_url = 'https://raw.githubusercontent.com/Tstrebe2/predicting-text-difficulty/main/assets/WikiLarge_Train.csv'

In [3]:
train_df = pd.read_csv(train_url)

In [5]:
train_df.sample(20)[['original_text', 
                     'label']].style.set_properties(subset=['original_text'], 
                                                    **{'width': '500px'})

Unnamed: 0,original_text,label
394408,"MNM made their debut in WWE on the April 14 , 2005 edition of SmackDown !",0
224848,Fingers and thumbs are types of digits .,0
245513,Origins,0
278946,Crest,0
85189,Knol is a Google project that aims to include user-written articles on a range of topics .,1
121957,"C-USA was founded in 1995 by the merger of the Metro Conference and Great Midwest Conference , two Division I conferences that did not sponsor football .",1
304572,"New talent such as Triple H and his D-Generation X faction , Mankind and The Rock were elevated to main event status on the WWF 's program .",0
39046,"He played in 24 seasons in the National Hockey League for the Toronto Maple Leafs , New York Rangers , Pittsburgh Penguins , and Buffalo Sabres .",1
209835,"Classical civilizations , notably the Persians , Macedonians , Nubians , Greeks , Parthians , Indians , Japanese , Chinese , and Koreans , fielded large numbers of archers in their armies .",0
166947,"The MiG-29 , along with the Sukhoi Su-27 , were developed to counter new American fighters such as the McDonnell Douglas F-15 Eagle , and the General Dynamics F-16 Fighting Falcon .",1


In [6]:
# Observations of things that need cleaning:
# join " 's " with their associated words; same with contractions  - done
# Some sentences are partials of other sentences within the corpus
# Need to address accents - not sure if I do
# Need to address punctuation (vectorizers) - done through regex
# Address weird quotations - (example at index 11005) '' yogurt ' ''
# Need to address -LRB and -RRB which are lemma references to left 
# and right parentheses - done
# remove 'â' (misformatting) - this is incorrect encoding. use ftfy package
# 


# Observations
  # difficulty can be a combination of hard words, hard to pronounce or unfamiliar
  # names, long-run on sentences, harder topics (eg Linux kernel or
  # referring to dog breeds as an example) or just non-sensical sentences without
  # context




In [7]:
# Found a solution to help with contractions
!pip install contractions
!pip install gensim
!pip install ftfy
!pip install unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 14.4 MB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 72.0 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.24
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting 

In [8]:
import re
import contractions
from gensim.utils import simple_preprocess
import ftfy
from unidecode import unidecode

def text_processing(s):

  # come up with a list of contractions to evaluate
  contract_lst = ['\'ve','\'ll','\'d','n\'t']

  # contractions were separated by a space, this connects them
  # back to the word (eg. could n't vs couldn't)
  for i in contract_lst:
    s = re.sub('\s' + i, i,s)
  
  # replace contractions
  s = contractions.fix(s)

  # remove empty quotes
  s = re.sub('\'\'','',s)

  # remove possessive "s"
  s = re.sub(' \'s','',s)

  # remove lrb and rrb references
  pattern = r'(-LRB-|-RRB-)+'
  s = re.sub(pattern,'',s)

  # remove ndash
  s = re.sub('\sndash\s','',s)

  # fix issues with incorrect encoding
  s = ftfy.fix_text(s)
  
  # remove punctuation and symbols
  s = re.sub('[$,.!?;:%@&\/\\\]*','',s)

  # remove numbers 
  s = re.sub('[0-9]*','',s)


  # positive lookbehind for cases where dashes proceeded by spaces or
  # other dashes (not suggesting a hyphenated word or name)
  s = re.sub('(?<=[ -])-','',s)


  return s
  # return simple_preprocess(s)



In [12]:

# text = 'SOS -LRB- Â Â Â â '' â '' â '' Â Â Â -RRB- is a Morse code . It is used as distress code , to signal danger .'
fixed = text_processing(text)
fixed

'I have tried to do something but could not'

In [None]:
# %%timeit
train_df['processed_text'] = train_df['original_text'].apply(text_processing)

In [None]:
train_df[['original_text','processed_text']].sample(20).style.set_properties(subset=['original_text'], 
                                                    **{'width': '500px'})

Unnamed: 0,original_text,processed_text
190982,Salts that hydrolyze to produce hydroxide ions when dissolved in water are basic salts and salts that hydrolyze to produce hydronium ions in water are acid salts .,Salts that hydrolyze to produce hydroxide ions when dissolved in water are basic salts and salts that hydrolyze to produce hydronium ions in water are acid salts
318305,"White dwarfs are not very bright because they are smaller than many brighter stars - not because they are cold . Some white dwarfs are blue , instead of white .",White dwarfs are not very bright because they are smaller than many brighter stars not because they are cold Some white dwarfs are blue instead of white
243421,The song of the black-capped chickadee is a clear whistle .,The song of the black-capped chickadee is a clear whistle
320045,The length of time it took to record the song created tension between the Beatles .,The length of time it took to record the song created tension between the Beatles
24647,"In Duck Hunt , players utilize the Nintendo Zapper Light Gun that must be plugged into their NES consoles , and attempt to shoot down either ducks or clay pigeons in mid-flight .",In Duck Hunt players utilize the Nintendo Zapper Light Gun that must be plugged into their NES consoles and attempt to shoot down either ducks or clay pigeons in mid-flight
221455,in computer games it is called a combo .,in computer games it is called a combo
387985,"Also during 1992 , shortly after releasing this , Rancid were signed to Bad Religion guitarist Brett Gurewitz 's label , Epitaph Records , and finally released their first album in 1993 , which is also self-titled .",Also during shortly after releasing this Rancid were signed to Bad Religion guitarist Brett Gurewitz label Epitaph Records and finally released their first album in which is also self-titled
105084,"Rage Against the Machine is noted for its innovative blend of alternative rock , punk rock , rap , heavy metal and funk as well as its revolutionary politics and lyrics .",Rage Against the Machine is noted for its innovative blend of alternative rock punk rock rap heavy metal and funk as well as its revolutionary politics and lyrics
377155,Today the flowers are national symbols of Korea .,Today the flowers are national symbols of Korea
15128,"Currently , EA 's most successful products are sports games published under its EA Sports label , games based on popular movie licenses such as Harry Potter and games from long-running franchises like Need for Speed , Medal of Honor , The Sims , Battlefield and the later games in the Burnout and Command & Conquer series .",Currently EA most successful products are sports games published under its EA Sports label games based on popular movie licenses such as Harry Potter and games from long-running franchises like Need for Speed Medal of Honor The Sims Battlefield and the later games in the Burnout and Command Conquer series


In [None]:
train_df[train_df['processed_text'].str.contains('The ball is about 3\xa01/4 inches  83 centimetres  in diameter and about four ounces  1134 grams')]

Unnamed: 0,original_text,label,formatted_text,processed_text
304503,The ball is about 3 1/4 inches -LRB- 8.3 centi...,0,The ball is about 3 1/4 inches 8.3 centimetre...,The ball is about 3 1/4 inches 83 centimetres...


In [None]:
train_df.iloc[110574][['original_text','processed_text']].values

array(["The appellation Driekoningen -LRB- the Three Kings -RRB- is also often found in 17th - and 18th-century Dutch star charts and seaman 's guides .",
       'The appellation Driekoningen  the Three Kings  is also often found in th  and th-century Dutch star charts and seaman guides '],
      dtype=object)