# Session 1 - Data Cleaning and Normalization

In this session, we will learn the basics of data cleaning and normalization. You will discover several techniques to take raw data and transform it as significant information

Keywords: *tokenization*, *stop words*, *regex*

## Overview of the project and the data
This project stage focuses on building a classification model to assign a company to its industry using the company description scraped online.

We will be discovering how to represent the information to be understood by the machine quickly and find the best method to gain accuracy. You will develop an independent pipeline that will take text as input and give you the closer industry related to your input information during this stage. 

For the presentation of the different technique we will be using a famous dataset called 20NewsGroup. You role will be to apply all the techniques learned during that process and aplpy them to the Fama industry data

Let's first look at the dataset that we will be using during this whole course using a famous library called [Pandas](https://www.educba.com/what-is-pandas/):

In [None]:
#Let's import the library
import pandas as pd #We define an alias for future usage of the library
import random
import regex as re
import unicodedata
import nltk
import spacy
import string

In [None]:
#We will import and read our dataset using pandas
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()

In [None]:
dataset = pd.DataFrame({"text": data["data"], "label": data["target"]})

In [None]:
#Let's now read look at a sample of the data
dataset.head()

Unnamed: 0,text,label
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14


There are **2** features in this dataset:
The text of the new article and the category of that new article.

Let's look at the first text and see how we can clean it:


In [None]:
sentences = dataset["text"].values

In [None]:
sentence = random.choice(sentences)
print(sentence)

From: tedr@athena.cs.uga.edu (Ted Kalivoda)
Subject: Re: Atheist's views on Christianity (was: Re: "Accepting Jeesus in your heart...")
Organization: University of Georgia - UCNS
Lines: 32

In article <Apr.19.05.13.48.1993.29266@athos.rutgers.edu>,
kempmp@phoenix.oulu.fi (Petri Pihko) wrote:
> 
> Jason Smith (jasons@atlastele.com) wrote: 
 
> Another answer is that God is the _source_ of all existence.
> This sounds much better, but I am tempted to ask: Does God
> Himself exist, then? If God is the source of His own existence,
> it can only mean that He has, in terms of human time, always
> existed. But this is not the same as the source of all existence.
> This argument sounds like God does not exist, but meta-exists,
> and from His meta-existent perspective, He created existence.
> I think this is actually a nonsolution, a mere twist of words.

Always existing and being the source of the existence of all other beings
is not problematic.

But, as you put, Being the source of "all" exi

[10 min] Let's list some information that might not be relevant to understand the category of the new article:

- Emails adresses
- Common words (adjectives)
- Empty spaces
- Punctuation
- Phone numbers
- Date

# Remove unwanted information

To remove these unrelevant information we can use several techniques that we will cover in this course

## Regex 
We can filter and clean the text using regular expressions. Short for regular expression, a regex is a string of text that allows you to create patterns that help match, locate, and manage text.

Here is an example:

![Regular expression](https://www.computerhope.com/jargon/r/regular-expression.gif)

[10 min] Let's work together on building and understanding the email regex:

In [None]:
pattern = "[\w.-]+@[\w.]+\.[a-zA-Z]{2,4}"
example = "We want to extract this email: antoine-121.gargot@hawk.iit.edu"
# re.findall(pattern, example)


In [None]:
email_regex = re.compile(pattern)
sentence = email_regex.sub(' ', sentence)

In our case, we don't want to use numbers or special characters because this information can be utilized in any newpaper and doesn't seem to be a strong marker of knowledge. We can apply a regex function to the dataset such as follow:

[Regex Cheatsheet](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285)

Useful to remove using Regex
- HTML
- URLs
- Unicode (emoji 🤩) [Emojis table](http://www.unicode.org/emoji/charts/full-emoji-list.html)
- Hashtags
- Phone numbers

In [None]:
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

example = """
<div>This is a test</div>
"""
sentence = remove_html(sentence)

def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

sentence = remove_urls(sentence)

def remove_emoji(string):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

example = "This is a test 😏 🙃 🤓"
sentence = remove_emoji(sentence)



In [None]:
"This is a test 🤓, 🙃".encode("unicode")

LookupError: unknown encoding: unicode

## String operation

Useful to remove/ change using string operation

- Number
- Punctuation
- Lower case

In [None]:
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

In [None]:
sentence = sentence.lower()
sentence = remove_punctuation(sentence)

In [None]:
# For numbers
sentence = "this is a test 1231341234123"
''.join([i for i in sentence if not i.isdigit()])

'this is a test '

In [None]:
print(sentence)

from   ted kalivoda
subject re atheists views on christianity was re accepting jeesus in your heart
organization university of georgia  ucns
lines 32

in article 
  petri pihko wrote
 
 jason smith   wrote 
 
 another answer is that god is the source of all existence
 this sounds much better but i am tempted to ask does god
 himself exist then if god is the source of his own existence
 it can only mean that he has in terms of human time always
 existed but this is not the same as the source of all existence
 this argument sounds like god does not exist but metaexists
 and from his metaexistent perspective he created existence
 i think this is actually a nonsolution a mere twist of words

always existing and being the source of the existence of all other beings
is not problematic

but as you put being the source of all existence including ones own
would mean that god came from nothing a concept alien to christianity and
theism  it is better to understand the classical concepts of necess

## Stop Words

Stop words are a set of common words in the language or the domain you are working on. We consider that stopwords will not play an essential role in classifying or document using keywords and will add unnecessary features to our documents.

Let's look at some of the basic words:



In [None]:
with open("../assets/stopwords.txt", "r") as f_in:
    stop_words = [i.strip().lower() for i in f_in.readlines()]

In [None]:
print(" ".join([token for token in sentence.split(" ") if token not in stop_words]))

  ted kalivoda
subject atheists views christianity accepting jeesus heart
organization university georgia  ucns
lines 32

in article 
  petri pihko wrote
 
 jason smith   wrote 
 
 another answer god source existence
 sounds much better tempted ask god
 exist god source existence
 mean terms human time always
 existed source existence
 argument sounds like god exist metaexists
 metaexistent perspective created existence
 think actually nonsolution mere twist words

always existing source existence beings
is problematic

but put source existence including ones own
would mean god came nothing concept alien christianity and
theism  better understand classical concepts necessary and
contingent existence  god exists necessarily always  god created
contingent beings  coherent solution existence long as
the concept god coherent
 
 best answer heard human reasoning incapable
 understanding questions atheist not
 accept answers since methods

not good answer  reason cannot means understand some

## Normalization

The objective of normalizing a text is to remove general as many words as possible in the same feature. Normalization is always optional but can be really useful when you will be working with more traditional machine learning algorithms.

### Removing accent from the text

When dealing with online texts, you will have to remove some accented characters from some letters, resulting from wrong encoding when scraping the information.

In [None]:
text = unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt 🙃').encode('ascii', 'ignore').decode('utf-8', 'ignore')
print(text)

Some Accented text 


After dealing with actual characters, you can look at words. There are several techniques that can be used to reduce the number of feature you will have to represent similar words

### Stemmer

First, let's look at stemming.
Stemming is the process of reducing inflected words to their word stem or root form

For example:

![play](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1539984207/stemminglemmatization_n8bmou.jpg)

There are several type of stemming algorithm, let's look at one of them, the Porter stemmer.




In [None]:
stemmer = nltk.stem.PorterStemmer()

words = ["playing", "plays", "played"]
words = [stemmer.stem(word) for word in words]
print(words)

#Will not affect
print(stemmer.stem("is"))

['play', 'play', 'play']
is


### Lemmatization


Lemmatization is almost similar to Stemmer, in regards that it will reduce the inflected words. But it differs in the way that it makes sure that the root word belongs to the language and will take the type of word into account (verb, adjective, noun...). In order to improve the performances, you might consider using a model called POS model (Part of Speech Tagging).

Because of this specificity, the lemmatizer will be slower than regular stemming.

In [None]:
nltk.download('wordnet')

lemmatizer = nltk.stem.WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
print(lemmatizer.lemmatize('is', 'v'))

be


In [None]:
!python -m spacy download en_core_web_sm >> /dev/null

2021-10-13 21:02:47.520184: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-13 21:02:47.520227: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m


In [None]:
# Using another library
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

doc = nlp(sentence)

In [None]:
print(sentence)

from   ted kalivoda
subject re atheists views on christianity was re accepting jeesus in your heart
organization university of georgia  ucns
lines 32

in article 
  petri pihko wrote
 
 jason smith   wrote 
 
 another answer is that god is the source of all existence
 this sounds much better but i am tempted to ask does god
 himself exist then if god is the source of his own existence
 it can only mean that he has in terms of human time always
 existed but this is not the same as the source of all existence
 this argument sounds like god does not exist but metaexists
 and from his metaexistent perspective he created existence
 i think this is actually a nonsolution a mere twist of words

always existing and being the source of the existence of all other beings
is not problematic

but as you put being the source of all existence including ones own
would mean that god came from nothing a concept alien to christianity and
theism  it is better to understand the classical concepts of necess

In [None]:
print(" ".join([token.lemma_ for token in doc]))

from    ted kalivoda 
 subject re atheist view on christianity be re accept jeesus in your heart 
 organization university of georgia   ucns 
 line 32 

 in article 
   petri pihko write 
 
  jason smith    write 
 
  another answer be that god be the source of all existence 
  this sound much well but I be tempt to ask do god 
  himself exist then if god be the source of his own existence 
  it can only mean that he have in term of human time always 
  exist but this be not the same as the source of all existence 
  this argument sound like god do not exist but metaexist 
  and from his metaexistent perspective he create existence 
  I think this be actually a nonsolution a mere twist of word 

 always exist and be the source of the existence of all other being 
 be not problematic 

 but as you put be the source of all existence include one own 
 would mean that god come from nothing a concept alien to christianity and 
 theism   it be well to understand the classical concept of nece

## For next week

- Create group of 2
- Ask for your dataset
- Complete Data cleaning and Processing tasks in the project.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=4b514847-e145-4e51-9c26-e306429d4631' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>