# **Text Preprocessing using SaraikiNLP**

## **Table of Contents**
1. Installation
2. Importing Preprocessing Module
3. Preprocessing Functions
   - Remove Links
   - Remove Hashtags
   - Remove Usernames
   - Remove Phone Numbers
   - Remove Numbers
   - Remove Punctuation
   - Remove White Space
   - Remove Multiple Spaces
   - Remove Line Breaks
   - Separate Numbers From Text
   - Remove Emojis and Symbols
   - Convert Text to Lowercase
   - Remove English
   - Retain Clean Saraiki

## **Installation**
<p align="left">
  <a href="https://pypi.org/project/SaraikiNLP/"><img src="https://img.shields.io/pypi/v/SaraikiNLP.svg"/></a>
  <a href="https://huggingface.co/SaraikiNLP"><img src="https://img.shields.io/badge/Hugging%20Face-SaraikiNLP-yellow?logo=huggingface"/></a>
<a href="https://github.com/SaraikiNLP/SaraikiNLP"><img src="https://img.shields.io/badge/Github-SaraikiNLP-blue?logo=github"/></a>

</p>

In [1]:
!pip install SaraikiNLP

Collecting SaraikiNLP
  Downloading saraikinlp-0.1.0rc2-py3-none-any.whl.metadata (8.2 kB)
Downloading saraikinlp-0.1.0rc2-py3-none-any.whl (9.6 kB)
Installing collected packages: SaraikiNLP
Successfully installed SaraikiNLP-0.1.0rc2


## **Importing Preprocessing Module**

In [2]:
from SaraikiNLP import preprocessing

## **Preprocessing Functions**

### **Remove Links**
Removes various kind of links.

In [3]:
print(preprocessing.remove_links(
    """میڈا ٹک ٹاک فالو کریسو تے یوٹوب وی وزٹ کریسو
       www.tiktok.com/@example

       https://youtu.be/@example"""
))

میڈا ٹک ٹاک فالو کریسو تے یوٹوب وی وزٹ کریسو


### **Remove Hashtags**
Removes hashtags both in Saraiki and English with underscores or any format.

In [4]:
print(preprocessing.remove_hashtags(
    """#twitter
       #سرائیکی_صوبہ سرائیکستان دی تحریک زور پکڑی ودی ہے


       #sariki_suba_
    """
))

سرائیکستان دی تحریک زور پکڑی ودی ہے


### **Remove Usernames**
Removes username handles of @username format.

In [5]:
print(preprocessing.remove_usernames(
    """
     @ali تُوں چپ کر یار
    """
))

تُوں چپ کر یار


### **Remove Phone Numbers**
Removes almost all kind of phone number variants.

In [6]:
print(preprocessing.remove_phone_numbers(
    """
     فون نمبر لِکھ ڳھن ناں0300 1234567
     شاکر کال نہ چاوے تاں ول اے ݙیکھیں 923001234567
     +92-300-123-456-7یا
     یا +92 3001234567
    """
))

فون نمبر لِکھ ڳھن ناں 
 شاکر کال نہ چاوے تاں ول اے ݙیکھیں 
 یا
 یا


### **Remove Numbers**
Removes all digits either in Saraiki, Arabic or English.

In [7]:
print(preprocessing.remove_numbers(
    """
     ٧١٧ ٢٥٣٧٤٧ اے پاسورڈ ہے میڈا اے کم نہ کرے تاں Example@123__456 کر
    """
))

اے پاسورڈ ہے میڈا اے کم نہ کرے تاں Example@ __ کر


### **Remove Punctuation**
Removes all punctuation marks of all languages.

In [8]:
print(preprocessing.remove_punctuation(
    """
    BREAKING NEWS:
    ...حکومت دا۔۔۔ اعلان
    123
    """
))

BREAKING NEWS 
 حکومت دا اعلان
 123


### **Remove White Space**
Removes multiple spaces (2 or more than 2 spaces), linebreaks and tab.

In [9]:
print(preprocessing.remove_whitespace(
    """
    BREAKING NEWS:
    ...حکومت دا۔۔۔    اعلان
    123
    """
))

BREAKING NEWS: ...حکومت دا۔۔۔ اعلان 123


### **Remove Multiple Spaces**
Removes multiple spaces (2 or more than 2 spaces) IGNORING line breaks (\n).

In [10]:
print(preprocessing.remove_multiple_spaces(
    """
    BREAKING NEWS:
    ...حکومت دا۔۔۔    اعلان
    123
    """
))

BREAKING NEWS:
 ...حکومت دا۔۔۔ اعلان
 123


### **Remove Line Breaks**
Removes ONLY line breaks, ignoring spaces.

In [11]:
print(preprocessing.remove_linebreaks(
    """
    BREAKING NEWS:
    ...حکومت دا۔۔۔ اعلان
    123
    """
))

BREAKING NEWS: ...حکومت دا۔۔۔ اعلان 123


### **Separate Numbers From Text**
Gives a room for words and numbers attached without space to breath.

In [12]:
print(preprocessing.separate_numbers_from_text(
    """
11لاکھ
اج10بندے
110810.22پوائنٹس
    """
))

11 لاکھ
اج 10 بندے
110810.22 پوائنٹس


### **Remove Emojis and Symbols**
Removes all kind of emojis, emoticons symbols and so on.

In [13]:
print(preprocessing.remove_emojis_and_symbols(
    """
    🥀شاکر توں💡💡وی کملا.. ہیں 🥺🥺 💡

    """
))

شاکر توں وی کملا.. ہیں


### **Convert Text to Lowercase**
Lowercases English in text.

In [14]:
print(preprocessing.to_lowercase(
    """
    BREAKING NEWS:
    حکومت دا اعلان

    """
))


    breaking news:
    حکومت دا اعلان

    


### **Remove English**
Removes English, its punctuation and numbers.

Then Applies **remove_multiple_spaces** to avoid uninteded word joining.

In [15]:
print(preprocessing.remove_english(
    """
    BREAKING NEWS:
    حکومت دا   اعلان

    """
))

حکومت دا اعلان


### **Retain Clean Saraiki**
**Normalizes the text and then retains these:**

 👉 Removes all other languages including English, punctuation etc.

 👉 Basic Arabic block (includes Urdu punctuation)

 👉 Arabic Supplement

 👉 Diacritical marks ( ِ َ ؒ ٗ ّ ْ ٌ ُ ۤ) etc

 👉 Whitespace (\s+)

 👉 Western digits (0-9)

 👉 Arabic-Indic digits (٠-٩)

 👉 Additional Saraiki characters (ݨ ݙ ڳ ڄ ٻ) etc

In [16]:
print(preprocessing.retain_clean_saraiki(
    """
    BREAKING NEWS:
    ...حکومت دا۔۔۔ اعلان
    123
    """
))

حکومت دا۔۔۔ اعلان
 123
