<a href="https://colab.research.google.com/github/IagoGarciaSuarez/MachineLearningNLP/blob/main/ML_Task_3_Preprocess.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Getting ready

In [None]:
!rm *.csv
!wget https://raw.githubusercontent.com/IagoGarciaSuarez/MachineLearningNLP/main/comments.csv
!pip install demoji
!pip install langdetect
!pip install textblob

In [None]:
import pandas as pd
import nltk
nltk.download('wordnet')
nltk.download("popular")
nltk.download('vader_lexicon')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
import demoji
from langdetect import detect
from textblob import TextBlob
import re
from itertools import groupby


In [None]:
comments = pd.read_csv("comments.csv")
comments

Unnamed: 0,class,text
0,Auto,I have recently purchased a J30T with moderat...
1,Camera,I bought this product because I need instant ...
2,Auto,I have owned my Buick since 53000 km and I am...
3,Camera,This was my first Digital camera so I did qui...
4,Camera,Minolta DiMAGE 7Hi is in a digital SLR with 5...
...,...,...
595,Auto,Recently our 12 year old Nissan Stanza decide...
596,Camera,I always do a lot of research before I buy an...
597,Auto,This car is an all around good buy If you ar...
598,Auto,I waited to write this until I have had 4 mon...


# 2. Preprocessing

In the preprocessing, special characters will be removed and every word will be passed to lower and lemmatized, after the process this last step requires. 

The objective is to get a target data with the class column, the original comment in case it's needed and the preprocessed text.

## 2.1. Removing special characters.

Numbers and non-alphanumeric characters are dispensable so they are substituted with an space.

In [None]:
comments['transf_text'] = comments['text'].apply(lambda com: re.sub("(\\d|\\W)+"," ",com))
comments

Unnamed: 0,class,text,transf_text
0,Auto,I have recently purchased a J30T with moderat...,I have recently purchased a J T with moderate...
1,Camera,I bought this product because I need instant ...,I bought this product because I need instant ...
2,Auto,I have owned my Buick since 53000 km and I am...,I have owned my Buick since km and I am now a...
3,Camera,This was my first Digital camera so I did qui...,This was my first Digital camera so I did qui...
4,Camera,Minolta DiMAGE 7Hi is in a digital SLR with 5...,Minolta DiMAGE Hi is in a digital SLR with me...
...,...,...,...
595,Auto,Recently our 12 year old Nissan Stanza decide...,Recently our year old Nissan Stanza decided i...
596,Camera,I always do a lot of research before I buy an...,I always do a lot of research before I buy an...
597,Auto,This car is an all around good buy If you ar...,This car is an all around good buy If you are...
598,Auto,I waited to write this until I have had 4 mon...,I waited to write this until I have had month...


## 2.2. To lower

With the resulting text obtained from the step above, the transformed text is passed to lower.

In [None]:
comments = comments.apply(lambda x: x.str.lower())
comments

Unnamed: 0,class,text,transf_text
0,auto,i have recently purchased a j30t with moderat...,i have recently purchased a j t with moderate...
1,camera,i bought this product because i need instant ...,i bought this product because i need instant ...
2,auto,i have owned my buick since 53000 km and i am...,i have owned my buick since km and i am now a...
3,camera,this was my first digital camera so i did qui...,this was my first digital camera so i did qui...
4,camera,minolta dimage 7hi is in a digital slr with 5...,minolta dimage hi is in a digital slr with me...
...,...,...,...
595,auto,recently our 12 year old nissan stanza decide...,recently our year old nissan stanza decided i...
596,camera,i always do a lot of research before i buy an...,i always do a lot of research before i buy an...
597,auto,this car is an all around good buy if you ar...,this car is an all around good buy if you are...
598,auto,i waited to write this until i have had 4 mon...,i waited to write this until i have had month...


## 2.3. Lemmatizing

To lemmatize, words are needed to be preprocessed in order to correct wrong words, translate emoticons and remove contractions and repeated terms.
The steps will be done as follows:


1.   Emoticon translation.
2.   Wrong words correction.
3.   Contractions removal.
4.   Repeated words removal.



### 2.3.1. Emoticon translation
No emoticons are found in any comment, so they don't need to be translated. One letter words can be removed now safely too.

In [None]:
[demoji.findall(com) for com in comments['transf_text'] if demoji.findall(com)]

[]

In [None]:
comments['transf_text'] = comments['transf_text'].apply(lambda com: re.sub('\\b\\w\\b', '', com))

### 2.3.2. Wrong words correction
To correct the wrong words, language is needed, and comments could be written in different ones. To verify if this is the case, language detection is used for each comment and then the different languages will be displayed.

After doing this, only English has been detected, so it's safe to proceed with the correction. Corrected comments are stored in the same field.

The contractions will be removed using regular expressions.

The last step, once the text is corrected and without contractions, is to remove repeated words. This is the last one to make sure there are no occurrences of some things like "don't not" and to avoid checking emoticons in case there were any.

In [None]:
languages = [detect(com) for com in comments['transf_text']]
dif_langs = []
for lang in languages:
  if lang not in dif_langs:
    dif_langs.append(lang)

dif_langs

['en']

In [None]:
comments['transf_text'] = comments['transf_text'].apply(lambda com: TextBlob(com).correct())
comments

Unnamed: 0,class,text,transf_text
0,auto,i have recently purchased a j30t with moderat...,"( , , h, a, v, e, , r, e, c, e, n, t, l, y, ..."
1,camera,i bought this product because i need instant ...,"( , , b, o, u, g, h, t, , t, h, i, s, , p, ..."
2,auto,i have owned my buick since 53000 km and i am...,"( , , h, a, v, e, , o, w, n, e, d, , m, y, ..."
3,camera,this was my first digital camera so i did qui...,"( , t, h, i, s, , w, a, s, , m, y, , f, i, ..."
4,camera,minolta dimage 7hi is in a digital slr with 5...,"( , m, i, n, o, r, c, a, , d, a, m, a, g, e, ..."
...,...,...,...
595,auto,recently our 12 year old nissan stanza decide...,"( , r, e, c, e, n, t, l, y, , o, u, r, , y, ..."
596,camera,i always do a lot of research before i buy an...,"( , , a, l, w, a, y, s, , d, o, , , l, o, ..."
597,auto,this car is an all around good buy if you ar...,"( , t, h, i, s, , c, a, r, , i, s, , a, n, ..."
598,auto,i waited to write this until i have had 4 mon...,"( , , w, a, i, t, e, d, , t, o, , w, r, i, ..."


### 2.3.3. Contractions removal

Using regular expressions, any possible contraction is undone.

In [None]:
patterns=[
(r'can\'t |can t ','cannot '),     #can't
(r'won\'t |won t ','will not '),   #won't
(r'i\'m ','i am' ),         # I'm
(r'ain\'t |ain t ', 'is not '),     # ain't
(r'(\w+)\'ll |(\w+) ll ', '\g<1> will '), # I will, you will, they will
(r'(\w+)n\'t |(\w+)n t ', '\g<1> not '), # ain't isn't don't
(r'(\w+)\'ve |(\w+) ve ', '\g<1> have '), # you've I've 
(r'(\w+)\'s |(\w+) s ', '\g<1> is '),      # he's she's
(r'(\w+)\'re |(\w+) re ', '\g<1> are '),  # you're they're
(r'(\w+)\'d |(\w+) d ', '\g<1> would '), # I'd you'd they'd
]

patterns = [(re.compile(regex), non_contr) for (regex, non_contr) in patterns]
def rem_contractions(com):
  for (pattern, non_contr) in patterns: 
    if re.search(pattern, com):
      s = re.sub(pattern, non_contr, com)
  return com
no_contractions = comments['transf_text'].apply(lambda com: rem_contractions(str(com)))
comments['transf_text'] = no_contractions  

### 2.3.4. Repeated words removal
In this step, tokenization will be applied to each comment. Stop words will also be removed. With this technique, we also avoid adding to the list of tokens repeated words.

In [None]:
def w_tokenize(com):
  tokens = word_tokenize(com)
  return [x[0] for x in groupby(tokens) if x[0] not in set(stopwords.words("english"))]

comments['transf_text'] = comments['transf_text'].apply(lambda com: w_tokenize(com))
comments

Unnamed: 0,class,text,transf_text
0,auto,i have recently purchased a j30t with moderat...,"[recently, purchased, moderate, miles, stopped..."
1,camera,i bought this product because i need instant ...,"[bought, product, need, instant, gratification..."
2,auto,i have owned my buick since 53000 km and i am...,"[owned, quick, since, km, approaching, must, s..."
3,camera,this was my first digital camera so i did qui...,"[first, digital, camera, quite, bit, research,..."
4,camera,minolta dimage 7hi is in a digital slr with 5...,"[minorca, damage, hi, digital, sir, megapixel,..."
...,...,...,...
595,auto,recently our 12 year old nissan stanza decide...,"[recently, year, old, nissan, stanza, decided,..."
596,camera,i always do a lot of research before i buy an...,"[always, lot, research, buy, anything, anymore..."
597,auto,this car is an all around good buy if you ar...,"[car, around, good, buy, cars, really, get, lo..."
598,auto,i waited to write this until i have had 4 mon...,"[waited, write, months, driving, kit, shortage..."


Now, it is possible to lemmatize with less computation time and the result is stored in the transformed text field. It is saved in a list format, but it will be more useful if it is saved as text, so it will be transformed into string format and then saved to a file to avoid running the preprocessing again, due to it's completion time.

In [None]:
lemmatizer = WordNetLemmatizer()

def lem_tokens(tok_list):
  return [lemmatizer.lemmatize(t, pos = 'v') for t in tok_list]

comments['transf_text'] = comments['transf_text'].apply(lambda tok: lem_tokens(tok))

comments

Unnamed: 0,class,text,transf_text
0,auto,i have recently purchased a j30t with moderat...,"[recently, purchase, moderate, miles, stop, ca..."
1,camera,i bought this product because i need instant ...,"[buy, product, need, instant, gratification, s..."
2,auto,i have owned my buick since 53000 km and i am...,"[own, quick, since, km, approach, must, say, n..."
3,camera,this was my first digital camera so i did qui...,"[first, digital, camera, quite, bite, research..."
4,camera,minolta dimage 7hi is in a digital slr with 5...,"[minorca, damage, hi, digital, sir, megapixel,..."
...,...,...,...
595,auto,recently our 12 year old nissan stanza decide...,"[recently, year, old, nissan, stanza, decide, ..."
596,camera,i always do a lot of research before i buy an...,"[always, lot, research, buy, anything, anymore..."
597,auto,this car is an all around good buy if you ar...,"[car, around, good, buy, cars, really, get, lo..."
598,auto,i waited to write this until i have had 4 mon...,"[wait, write, months, drive, kit, shortage, wi..."


In [None]:
comments['transf_text'] = comments['transf_text'].apply(lambda c: ' '.join(c))
comments

Unnamed: 0,class,text,transf_text
0,auto,i have recently purchased a j30t with moderat...,recently purchase moderate miles stop car look...
1,camera,i bought this product because i need instant ...,buy product need instant gratification stand t...
2,auto,i have owned my buick since 53000 km and i am...,own quick since km approach must say nicest ca...
3,camera,this was my first digital camera so i did qui...,first digital camera quite bite research unfor...
4,camera,minolta dimage 7hi is in a digital slr with 5...,minorca damage hi digital sir megapixel cod se...
...,...,...,...
595,auto,recently our 12 year old nissan stanza decide...,recently year old nissan stanza decide time re...
596,camera,i always do a lot of research before i buy an...,always lot research buy anything anymore talk ...
597,auto,this car is an all around good buy if you ar...,car around good buy cars really get lot extra ...
598,auto,i waited to write this until i have had 4 mon...,wait write months drive kit shortage wife hand...


In [None]:
comments.to_csv('comments_targetdata.csv')