### Welcome to the second module - Intro to NLP & Text pre-processing basics

## Intro
While building machine learning models, data cleaning and preprocessing is a crucial step in order to create a reliable dataset. This improves the quality of your training data.
#### Text data needs to be cleaned and encoded to numerical values before giving them to machine learning models, this process of cleaning and encoding is called as Text Preprocessing

Today we are going to learn some basic text cleaning steps; 
* Understanding the data - See what the data is all about. What should be considered for cleaning the data (punctuations, stopwords, whitespace, etc)
* Basic cleaning - What parameters need to be considered for cleaning of text data, like lowercasing, removal of punctuations,removal of stopwords. 

#### Importing libraries
* pandas for data manipulation and analysis
* re for for using regular expression (RE) functions, to check if a particular string matches the given regular expression 

In [9]:
import pandas as pd  
import re
import string 

In [None]:
#load the dataset
dataset_url = 'https://github.com/MBAZA-NLP/nlp-training/blob/main/data/kinnews_raw_500.csv'
data = pd.read_csv(dataset_url)
data

#### Lower casing
* This is one of the basic pre-processing step. This is an important steps to perform as it helps you to convert all the strings into same casing format so that the texts like Lower, lower, and LOWER are considered same. This helps you to reduce the duplication of same word which might be counted as unique words.

In [7]:
df['content_lower'] = df['content'].str.lower()
df.head()

'we are headed to the second meetup for the week'

#### Removal of punctuations
This is again a text standardization process that will help to treat 'umva' and 'umva!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the `string.punctuation` in python contains the following punctuation symbols 

* `!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~`

Additionally, we can also use regex (regular expressions)
* `[^a-zA-Z0-9\s]` 
This pattern means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".
In fact, inserting the special character ^ at the first place of your regex, you will get the negation.


* `re.sub(pattern, replacement, text)`


In [13]:
# With string.punctuation list
# The str.maketrans() method takes three arguments, the first two are empty strings
# and the third is the list of punctuations we remove

punct_set = string.punctuation 
def remove_punct(text):
    """custom function to remove all punctuations"""
    return text.translate(str.maketrans('', '', punct_set))

# df["text_no_punct"] = df["text"].apply(lambda text: remove_punct(text))
# df.head()

In [8]:
## With re.sub function

def remove_special_chars(text, remove_digits=True):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text

In [None]:
string_noSymbols = remove_special_chars(data, remove_digits=False)
data_noSymbols ['content'] = data['content'].apply(remove_special_chars)
data_noSymbols

#### Removal of stopwords
Stopwords are the unnecessary words that even if removed the sentiment of the sentence doesnt change.
> The stopwords used here are from the KKLTK Python package for Kinyarwanda and Kirundi languages processing. KKLTK currently provides the sets of stopwords for both languages.  
###### Example 
> Umwihariko wa buri karere ku matungo yabazwe kuri Noheli > Umwihariko buri karere matungo yabazwe Noheli (the removed stopwords are "wa", "ku", "kuri")

In [14]:
# Create list of stopwords in Kinyarwamda
STOPWORD_KN = {'aba', 'abo', 'aha', 'aho', 'ari', 'ati', 'aya', 'ayo', 'ba', 'baba', 'babo', 'bari', 'be', 'bo', 'bose',
           'bw', 'bwa', 'bwo', 'by', 'bya', 'byo', 'cy', 'cya', 'cyo', 'hafi', 'ibi', 'ibyo', 'icyo', 'iki',
           'imwe', 'iri', 'iyi', 'iyo', 'izi', 'izo', 'ka', 'ko', 'ku', 'kuri', 'kuva', 'kwa', 'maze', 'mu', 'muri',
           'na', 'naho','nawe', 'ngo', 'ni', 'niba', 'nk', 'nka', 'no', 'nta', 'nuko', 'rero', 'rw', 'rwa', 'rwo', 'ry',
           'rya','ubu', 'ubwo', 'uko', 'undi', 'uri', 'uwo', 'uyu', 'wa', 'wari', 'we', 'wo', 'ya', 'yabo', 'yari', 'ye',
           'yo', 'yose', 'za', 'zo'}

In [15]:
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORD_KN])

# df["content"] = df["content"].apply(lambda text: remove_stopwords(text))
# df.head()

### Parallel datasets 
#### `UPCOMING ZINDI HACKATHON ON TEXT PRE-PROCESSING FOR PARALLEL DATA`
> #### `PRACTICE EXERCISE`

In [None]:
#Load the datasets
Raw_kn = pd.read_csv("....") 
Raw_en = pd.read_csv("....")

In [None]:
#Clean the datasets, follow steps above

In [None]:
#merge datasets