**Project Details**

| | Details |
|----------|---------|
| Author   | Alfrethanov Christian Wijaya |
| Dataset  | regex-training.tsv |
| Goal     | Data Analysis / Science training with regex (Regular Expression). |

# **REGEX TRAINING**

**Dataset :**

regex-training.tsv (Tab Seperated Value)

<br>

**Goal :**

Practicing regex (Regular Expression).

<br>

**Description :**

This is the result of what I have learned about Regex from DQLab Academy.

# **WHAT IS REGEX?**

**DESCRIPTION**

**Regular Expression**, often referred to as `regex` or `regexp`, is a sequence of special characters that defines a pattern for matching text. Regex is important when conducting data analysis or prediction because it is frequently used, especially in the Data Preprocessing step. Regex is commonly employed in matching, searching, and manipulating string/object data. Regex is case-sensitive, meaning it pays attention to uppercase and lowercase letters. However, if case sensitivity is to be ignored, the parameter "case" can be initialized as False (`case = False`).

<br>

--------
<br>
<center>

**NOTATIONS in REGEX**

| Notation | Notation Name | Purpose | Matching Text |
|----------|---------------|--------------------------------------------------------|----------------------------------------------|
| `.` | Wildcard | Matches any single character except newline | Regex (pa.i) will match padi, paTi, pa9i, etc. |
| `?` | Optionality | The character before "?" is optional | Regex (Bid?jak) will match Bidjak or Bijak |
| `+` | Repeatability | Repeats the preceding character with condition | Regex (go+L) will match goL, goooL, etc. |
| `*` | Repeatability | Repeats the preceding character with optional condition | Regex (go*L) will match goL, gooooL, etc. |
| `[]` | Choice / Set | Limits matching pattern to specified characters | Regex (mak[Ai]n) will match makAn or makin |
| `-` | Range | Specifies a specific range of characters | Regex ([b-d]uka) will match buka, cuka, duka |
| `^[]` | Complement | Indicates negation or opposite meaning | Regex (^[aiueo]) will match consonants |
| `^` | Prefix | Matches the beginning of a string | Regex (^ayam) will match ayam goreng, etc. |
| `$` | Suffix | Matches the end of a string | Regex (ayam$) will match kari ayam, etc. |
| `{}` | Quantifier | Specifies required number of characters/groups/classes | Regex (he.{2}o) will match hello, herro, etc. |

</center>
<br>

---
<br>
<center>

**SETS and GROUPS**

In regex, it is possible to combine various notations. However, it's not limited to that, it's also important to understand **sets** and **groups** in regex. Sets are a collection of characters enclosed in square brackets `[]` with special meanings, while groups or commonly known as capture and group are a collection of regex notations grouped together using opening and closing parentheses `()`. Consider the following example of sets:

<br>

| Notation | Description |
|----------|----------------------------------------------------|
| `[abc]` | The text will match if it contains a, b, or c. |
| `[a-z]` | The text will match if it contains lowercase letters. |
| `[A-Z]` | The text will match if it contains uppercase letters. |
| `[a-zA-Z]` | The text will match if it contains both lowercase and uppercase letters. |
| `[^a-z]` | The text will match with characters other than lowercase letters (negation). |
| `[0-9]` | The text will match with numeric characters. |

<br>

`\\`

`\\` is a regex notation that will return the value of a group. For example:

There are two groups, which is `(0-9)-(a-b)`, then Group 1 `(0-9)` = `\\1` and Group 2 `(a-b)` = `\\2`.

You can use "`\\1` and `\\2`" or "`r'\1'` and `r'\2'`".

</center>
<br>

---
<br>
<center>

**.contains()**

The `.contains()` syntax in Python is used to test whether a word pattern or regex is present in a string or not. It returns True if there is a match and False if there is not. The parameters that can be used in the `.contains()` syntax are as follows:

<br>

| Parameter | Definition |
|-----------|---------------------------------------------------------------|
| `pat` | A string of characters or a regular expression pattern (*pat* = *pattern*) |
| `case` | If initialized as True, it considers case sensitivity. If False, it ignores case sensitivity. |
| `flags` | Flags to be passed to the re module (requires the re library, e.g., re.IGNORECASE) |
| `na` | Fills missing values |
| `regex` | If initialized as True, pat is treated as a regular expression. If False, pat is treated as a plain string. |

<br>

<small>**Notes**: The only required parameter to be initialized is `pat`, the rest do not need to be initialized

as they already have default settings. So, adjust them according to your needs.</small>

</center>
<br>

---
<br>
<center>

**.replace()**

The `.replace()` syntax in Python is used to replace one or more characters that need to be changed with another character or string. The parameters that can be used in the `.replace()` syntax are as follows:

<br>

| Parameter | Definition |
|-----------|--------------------------------------------------------------|
| `pat` | A sequence of characters or a regular expression pattern to be replaced |
| `repl` | The text or character to replace with |
| `n` | The maximum number of replacements to be made from the beginning (default = -1: meaning all occurrences will be replaced) |
| `case` | If initialized as True, it considers case sensitivity. If False, it ignores case sensitivity. |
| `flags` | Flags to be passed to the re module (requires the re library, e.g., re.IGNORECASE) |
| `regex` | If initialized as True, pat is treated as a regular expression. If False, pat is treated as a plain string. |

<br>

<small>**Notes**: The required parameters to be initialized are only `pat` and `repl`, the rest do not need to be initialized

as they already have default settings. So, adjust them according to your needs.</small>

</center>
<br>

---

---

<center>

# **START**

</center>

---

# **Importing Library**

In [None]:
# Import Library
import pandas as pd

# **Importing Dataset**

df = Dataset for Training

df2 = Dataset for Implementing all of the Regex method that i've learned from Training

In [None]:
# Import Dataset
df = pd.read_csv("regex-training.tsv", sep = '\t')

df2 = df.copy()

In [None]:
df.head()

Unnamed: 0,no_pencatatan,tanggal_catat,kota,jumlah_member,staf_pencatat
0,1,01-05-2020,Jakarta,311,Andra
1,2,30-06-2020,Jakarta,1I2,Andra
2,3,05/02/2020,Bandung,5S0,Antara
3,4,06/28/2020,Bandung,670,Antara
4,5,05/10/2020,Semarang,81O,Senja


In [None]:
df2.head()

Unnamed: 0,no_pencatatan,tanggal_catat,kota,jumlah_member,staf_pencatat
0,1,01-05-2020,Jakarta,311,Andra
1,2,30-06-2020,Jakarta,1I2,Andra
2,3,05/02/2020,Bandung,5S0,Antara
3,4,06/28/2020,Bandung,670,Antara
4,5,05/10/2020,Semarang,81O,Senja


# **Creating New Column using Regex**

## **.contains()**

The `.contains()` syntax in Python is used to test whether a word pattern or regex is present in a string or not. It returns True if there is a match and False if there is not.

### **Every city names that contains j, J, s, or even S must be True**

Using : `^`, `|`, `()`, and `case = False`

For Column : **kota**

Goal :

City name that contains j, J, s, S must be True
1. Jakarta = True
2. jakarta = True
3. Semarang = True
4. semarang = True

In [None]:
# Creating new column named 'kota_awalan_j_s'
df['kota_awalan_j_s'] = df['kota'].str.contains('^(j|s)', case = False)

print(df[['kota','kota_awalan_j_s']])

       kota  kota_awalan_j_s
0   Jakarta             True
1   Jakarta             True
2   Bandung            False
3   Bandung            False
4  Semarang             True
5  Semarang             True


  df['kota_awalan_j_s'] = df['kota'].str.contains('^(j|s)', case = False)


### **Every staff's names that contains "Senja" and are different from "Senja" must be True**

Using : `.` and `?`

For Column : **staf_pencatat**

Goal :
1. Senja = True
2. Sendja = True
3. Sen_ja = True

In [None]:
# Creating new column named 'pencatat_senja'
df['pencatat_senja'] = df['staf_pencatat'].str.contains('Sen.?ja')

print(df[['staf_pencatat','pencatat_senja']])

  staf_pencatat  pencatat_senja
0         Andra           False
1         Andra           False
2        Antara           False
3        Antara           False
4         Senja            True
5        Sendja            True


### **Are there any non-numerical value in a numerical column?**

Using : `[]` and `^`

For Column : **jumlah_member**

Goal :

**jumlah_member** is a non-numerical data (int), so every values in **jumlah_member** column that contains non-numerical value must be checked.
1. 311 = False
2. 1I2 = True
3. 5S0 = True
4. 670 = False
5. 81O = True
6. 1O2 = True

In [None]:
# Creating new column named 'char_nonnumerik'
df['char_nonnumerik'] = df['jumlah_member'].str.contains('[^0-9]')

print(df[['jumlah_member','char_nonnumerik']])

  jumlah_member  char_nonnumerik
0           311            False
1           1I2             True
2           5S0             True
3           670            False
4           81O             True
5           1O2             True


## **.replace()**

The `.replace()` syntax in Python is used to replace one or more characters that need to be changed with another character or string.

### **Replace Words to a Specific Word**

Using : `.` and `?`

For Column : **staf_pencatat**

Goal :
1. Sendja = Senja
2. Sen_ja = Senja
3. Etc.

In [None]:
# Replacing every word that contains "Senja" and is different from "Senja" to "Senja"
df['staf_pencatat'] = df['staf_pencatat'].str.replace('Sen.?ja', 'Senja')

print(df['staf_pencatat'])

0     Andra
1     Andra
2    Antara
3    Antara
4     Senja
5     Senja
Name: staf_pencatat, dtype: object


  df['staf_pencatat'] = df['staf_pencatat'].str.replace('Sen.?ja', 'Senja')


### **Replace Non-Numerical to Numerical**

Using : `[]`, `for loop`, and `mapping`

For Column : **jumlah_member**

Goal :

**jumlah_member** is a non-numerical data (int), so every values in **jumlah_member** column that contains non-numerical value must be replaced to numerical value. Here are the conditions :
1. O = 0
2. I = 1
3. S = 5

In [None]:
# Replacing the non-numerical value in 'jumlah_member' to numerical value
mapchange = {'O':'0','I':'1','S':'5'}
df['jumlah_member_clean'] = df['jumlah_member']

for changefrom, changeto in mapchange.items():
	df['jumlah_member_clean'] = df['jumlah_member_clean'].str.replace(changefrom, changeto, case = False)

print(df[['jumlah_member', 'jumlah_member_clean']])

  jumlah_member jumlah_member_clean
0           311                 311
1           1I2                 112
2           5S0                 550
3           670                 670
4           81O                 810
5           1O2                 102


### **Deleting Non-Numerical**

Using : `[]` and `^`

For Column : **jumlah_member**

Goal :
1. 311 = 311
2. 1I2 = 12
3. 5S0 = 50
4. 670 = 670
5. 81O = 81
6. 1O2 = 12

In [None]:
# Deleting the non-numerical values in 'jumlah_member'
df['jumlah_member'] = df['jumlah_member'].str.replace('[^0-9]', '')

print(df['jumlah_member'])

0    311
1     12
2     50
3    670
4     81
5     12
Name: jumlah_member, dtype: object


  df['jumlah_member'] = df['jumlah_member'].str.replace('[^0-9]', '')


### **Fixing the Date Format**

Using : `\\`, `[]`, and `{number}`

For Column : **tanggal_catat**

Goal :

DD-MM-YYYY = MM/DD/YYYY

In [None]:
# Fixing the date format
df['tanggal_catat'] = df['tanggal_catat'].str.replace('([0-9]{2})-([0-9]{2})-([0-9]{4})','\\2/\\1/\\3')

print(df['tanggal_catat'])

0    05/01/2020
1    06/30/2020
2    05/02/2020
3    06/28/2020
4    05/10/2020
5    06/28/2020
Name: tanggal_catat, dtype: object


  df['tanggal_catat'] = df['tanggal_catat'].str.replace('([0-9]{2})-([0-9]{2})-([0-9]{4})','\\2/\\1/\\3')


# **Implementing All of Regex Method that I've Learned**

In [None]:
print("Original Table:")
print(df2)

# Fixing the date format
mapchange = {'([0-9]{2})-([0-9]{2})-([0-9]{4})': '\\3-\\2-\\1', '([0-9]{2})/([0-9]{2})/([0-9]{4})' : '\\3-\\1-\\2'}
for changefrom, changeto in mapchange.items():
   	df2['tanggal_catat'] = df2['tanggal_catat'].str.replace(changefrom, changeto)

# Changing the datatypes of 'tanggal_catat' column to datetime
df2['tanggal_catat'] = pd.to_datetime(df2['tanggal_catat'])
 
# Deleting non-numerical value in 'jumlah_member' column and changing the datatypes into integer
df2['jumlah_member'] = df2['jumlah_member'].str.replace('[^0-9]','')
df2['jumlah_member'] = df2['jumlah_member'].astype(int)
 
# Replacing any words that contains "Senja" and is different from "Senja" to just "Senja"
df2['staf_pencatat'] = df2['staf_pencatat'].str.replace('Sen.?ja', 'Senja')
 
# Result
print("\nFinal Table:")
print(df2) 

Original Table:
   no_pencatatan tanggal_catat      kota jumlah_member staf_pencatat
0              1    01-05-2020   Jakarta           311         Andra
1              2    30-06-2020   Jakarta           1I2         Andra
2              3    05/02/2020   Bandung           5S0        Antara
3              4    06/28/2020   Bandung           670        Antara
4              5    05/10/2020  Semarang           81O         Senja
5              6    06/28/2020  Semarang           1O2        Sendja

Final Table:
   no_pencatatan tanggal_catat      kota  jumlah_member staf_pencatat
0              1    2020-05-01   Jakarta            311         Andra
1              2    2020-06-30   Jakarta             12         Andra
2              3    2020-05-02   Bandung             50        Antara
3              4    2020-06-28   Bandung            670        Antara
4              5    2020-05-10  Semarang             81         Senja
5              6    2020-06-28  Semarang             12         Sen

  df2['tanggal_catat'] = df2['tanggal_catat'].str.replace(changefrom, changeto)
  df2['jumlah_member'] = df2['jumlah_member'].str.replace('[^0-9]','')
  df2['staf_pencatat'] = df2['staf_pencatat'].str.replace('Sen.?ja', 'Senja')


---

<center>

#**END**

</center>

---

# **Training Source :**

DQLab