# Reflective Writing for Data Science Career Path - Codecademy
by [Charalampos Spanias](https://cspanias.github.io/aboutme/) - February 2021

## Content
1. Getting Started with Data Science
2. Python Fundamentals
3. Data Acquisition
4. Data Manipulation with Pandas
5. [Data Wrangling \& Tidying](#wrangling)
    1. [Fundamentals](#fun)
    1. [Regural Expressions](#regexes)
        1. [Intro](#regexintro)
        1. [Literals](#literals)
        1. [Alternation](#alternation)
        1. [Character Sets](#charsets)
        1. [Wild for Wildcards](#wildcards)
        1. [Ranges](#ranges)
        1. [Shorthand Character Classes](#classes)
        1. [Grouping](#grouping)
        1. [Quantifiers - Fixed](#fixedquant)
        1. [Quantifiers - Optional](#optionalquant)
        1. [Quantifiers - 0 or More, 1 or More](#starplus)
        1. [Anchors](#anchors)

<a name="wrangling"></a>
# 5. Data Wrangling & Tidying

<a name="fun"></a>
# 5.1 Fundamentals

Frequently when we work with data, we encounter __unstructured and/or messy data__. While the data may be messy, it is still extremely informative. 

We need to __clean, transform, and sometimes manipulate the data__ structure to gain any insights. This process is often called __data wrangling__ or __data munging__.

At the final stages of the data wrangling process, we will have __a dataset that we can easily use for modeling purposes or for visualization purposes__. This is a __tidy dataset__ where each column is a variable and each row is an observation.

In [58]:
import pandas as pd

# import data
df = pd.read_csv('mma_data.csv')

# check last 5 rows
df.tail()

Unnamed: 0,click_page_details_name,click_page_details_record,click_page_details_current_streak,click_page_details_age,click_page_details_dob,click_page_details_class,click_page_details_height,click_page_details_reach,click_page_details_last_fight,click_page_details_gym,click_page_details_gym_url,click_page_details_opponent_name,click_page_details_opponent_result,click_page_details_opponent_result_url,click_page_details_opponent_fight_date
170,Giannis Stavridis,0-2-0 (Win-Loss-Draw),2 Losses,,,Welterweight,,,"October 16, 2021",,,Pepi Ivanov,Loss · Decision,https://www.tapology.com/fightcenter/bouts/371...,2018.03.31
171,Yoan Petrov,0-2-0 (Win-Loss-Draw),2 Losses,,,Welterweight,,,"October 24, 2020",,,Labis Flindris,Cancelled Bout,https://www.tapology.com/fightcenter/bouts/518...,2020.10.24
172,Yoan Petrov,0-2-0 (Win-Loss-Draw),2 Losses,,,Welterweight,,,"October 24, 2020",,,Sarantis Nikoglou,Cancelled Bout,https://www.tapology.com/fightcenter/bouts/525...,2020.10.24
173,Yoan Petrov,0-2-0 (Win-Loss-Draw),2 Losses,,,Welterweight,,,"October 24, 2020",,,Georgios Kougioumtzidis,Loss · Decision · Unanimous,https://www.tapology.com/fightcenter/bouts/534...,2020.10.24
174,Yoan Petrov,0-2-0 (Win-Loss-Draw),2 Losses,,,Welterweight,,,"October 24, 2020",,,Emil Tepavicharov,Loss · Triangle Choke · 3:12 · R1,https://www.tapology.com/fightcenter/bouts/479...,2019.11.16


In [59]:
# check shape
df.shape

(175, 15)

In [60]:
# check NaNs
df.isna().sum()

click_page_details_name                     0
click_page_details_record                   0
click_page_details_current_streak           0
click_page_details_age                     79
click_page_details_dob                     79
click_page_details_class                    0
click_page_details_height                  76
click_page_details_reach                  165
click_page_details_last_fight               0
click_page_details_gym                     69
click_page_details_gym_url                 69
click_page_details_opponent_name            0
click_page_details_opponent_result         30
click_page_details_opponent_result_url     30
click_page_details_opponent_fight_date      0
dtype: int64

In [61]:
# check duplicates
df.duplicated().sum()

# if duplicates
# df.drop_duplicates(inplace=True)

0

To have some consistency across column names, we will iterate over the column names of our dataset and convert them all to lowercase using the `map()` and `lower()` functions. 

We also need to make sure to include the `str` function to identify that we are working with strings.

In [54]:
# convert col names to lower
df.columns = map(str.lower, df.columns)

df.head()

Unnamed: 0,click_page_details_name,click_page_details_record,click_page_details_current_streak,click_page_details_age,click_page_details_dob,click_page_details_class,click_page_details_height,click_page_details_reach,click_page_details_last_fight,click_page_details_gym,click_page_details_gym_url,click_page_details_opponent_name,click_page_details_opponent_result,click_page_details_opponent_result_url,click_page_details_opponent_fight_date
0,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Christian Eckerlin,Loss · Rear Naked Choke · 2:55 · R1,https://www.tapology.com/fightcenter/bouts/615...,2021.12.04
1,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Jesse Urholin,Loss · Ground & Pound · 3:36 · R2,https://www.tapology.com/fightcenter/bouts/600...,2021.10.01
2,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Mukhamed Berkhamov,Cancelled Bout,https://www.tapology.com/fightcenter/bouts/521...,2020.09.26
3,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Konstantin Linnik,Win · Punches · 0:48 · R1,https://www.tapology.com/fightcenter/bouts/472...,2019.12.15
4,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Piotr Wawrzyniak,Win · Soccer Kick · 4:56 · R1,https://www.tapology.com/fightcenter/bouts/452...,2019.10.05


In [41]:
# check dtypes
df.dtypes

click_page_details_name                    object
click_page_details_record                  object
click_page_details_current_streak          object
click_page_details_age                    float64
click_page_details_dob                     object
click_page_details_class                   object
click_page_details_height                  object
click_page_details_reach                   object
click_page_details_last_fight              object
click_page_details_gym                     object
click_page_details_gym_url                 object
click_page_details_opponent_name           object
click_page_details_opponent_result         object
click_page_details_opponent_result_url     object
click_page_details_opponent_fight_date     object
dtype: object

In [42]:
# check number of unique values per col
df.nunique()

click_page_details_name                    25
click_page_details_record                  14
click_page_details_current_streak           9
click_page_details_age                      7
click_page_details_dob                      9
click_page_details_class                    3
click_page_details_height                   7
click_page_details_reach                    1
click_page_details_last_fight              10
click_page_details_gym                      5
click_page_details_gym_url                  5
click_page_details_opponent_name          147
click_page_details_opponent_result         89
click_page_details_opponent_result_url    128
click_page_details_opponent_fight_date    105
dtype: int64

__`df.where(cond, other=NoDefault.no_default, inplace=False, axis=None, level=None, errors='raise', try_cast=NoDefault.no_default)`__

[Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html)

Where `cond` is `True`, keep the original value. Where `False`, replace with corresponding value from other.

In [63]:
import numpy as np

# rename cols
df.rename(columns=lambda s: s.replace("click_page_details_", ""), inplace=True)

df.head()

Unnamed: 0,name,record,current_streak,age,dob,class,height,reach,last_fight,gym,gym_url,opponent_name,opponent_result,opponent_result_url,opponent_fight_date
0,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Christian Eckerlin,Loss · Rear Naked Choke · 2:55 · R1,https://www.tapology.com/fightcenter/bouts/615...,2021.12.04
1,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Jesse Urholin,Loss · Ground & Pound · 3:36 · R2,https://www.tapology.com/fightcenter/bouts/600...,2021.10.01
2,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Mukhamed Berkhamov,Cancelled Bout,https://www.tapology.com/fightcenter/bouts/521...,2020.09.26
3,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Konstantin Linnik,Win · Punches · 0:48 · R1,https://www.tapology.com/fightcenter/bouts/472...,2019.12.15
4,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Piotr Wawrzyniak,Win · Soccer Kick · 4:56 · R1,https://www.tapology.com/fightcenter/bouts/452...,2019.10.05


In [70]:
# here our .where() function replaces age values that are 0 with NaN values
df['age'].where(df['age'] > 0 , np.nan, inplace=True)

df.head()

Unnamed: 0,name,record,current_streak,age,dob,class,height,reach,last_fight,gym,gym_url,opponent_name,opponent_result,opponent_result_url,opponent_fight_date
0,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Christian Eckerlin,Loss · Rear Naked Choke · 2:55 · R1,https://www.tapology.com/fightcenter/bouts/615...,2021.12.04
1,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Jesse Urholin,Loss · Ground & Pound · 3:36 · R2,https://www.tapology.com/fightcenter/bouts/600...,2021.10.01
2,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Mukhamed Berkhamov,Cancelled Bout,https://www.tapology.com/fightcenter/bouts/521...,2020.09.26
3,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Konstantin Linnik,Win · Punches · 0:48 · R1,https://www.tapology.com/fightcenter/bouts/472...,2019.12.15
4,John Palaiologos,18-11-1 (Win-Loss-Draw),2 Losses,34.0,1987.08.19,Welterweight,"6'2"" (188cm)",,"December 04, 2021",EFL Martial Arts Academy,https://www.tapology.com/gyms/4110-efl-martial...,Piotr Wawrzyniak,Win · Soccer Kick · 4:56 · R1,https://www.tapology.com/fightcenter/bouts/452...,2019.10.05


__Characterizing missingness with crosstab__

Let’s try to understand the missingness in the url column by counting the missing values across each borough. We will use the `crosstab()` function in pandas to do this.

The `crosstab()` computes the __frequency of two or more variables__. 

To look at the missingness in the url column we can add isna() to the column to identify if there is an NaN in that column. This will return a boolean, True if there is a NaN and False if there is not. In our crosstab, we will look at all the boroughs present in our data and whether or not they have missing url links.


In [73]:
pd.crosstab(
 
        # tabulates the boroughs as the index
        df['name'],  
 
        # tabulates the number of missing values in the url column as columns
        df['age'].isna(), 
 
        # names the rows
        rownames = ['name'],
 
        # names the columns 
        colnames = ['url is na']) 

url is na,False,True
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alexandros Alevras,0,1
Andreas Tricomitis,15,0
Aristokratis Papadopoulos,2,0
Bill Tranakidis,0,1
Evangelos Zafeiris,0,1
Georgios Daskalakis,0,4
Georgios Kaisidis,2,0
Georgios Kougioumtzidis,2,0
Giannis Bachar,0,10
Giannis Stavridis,0,2


__Removing prefixes__

It might be easier to read what url links are by removing the prefixes of the websites, such as “https://www.". We will use `str.lstrip()` to __remove the prefixes__. 

Similar to when we were working with our column names, we need to make sure to include the `str` function to identify that we are working with strings and `lstrip` to remove parts of the string from the `left side`.

In [75]:
df['gym_url'].head()

0    https://www.tapology.com/gyms/4110-efl-martial...
1    https://www.tapology.com/gyms/4110-efl-martial...
2    https://www.tapology.com/gyms/4110-efl-martial...
3    https://www.tapology.com/gyms/4110-efl-martial...
4    https://www.tapology.com/gyms/4110-efl-martial...
Name: gym_url, dtype: object

In [76]:
# .str.lstrip('https://') removes the “https://” from the left side of the string
df['gym_url'] = df['gym_url'].str.lstrip('https://') 

df['gym_url'].head() 

0    www.tapology.com/gyms/4110-efl-martial-arts-ac...
1    www.tapology.com/gyms/4110-efl-martial-arts-ac...
2    www.tapology.com/gyms/4110-efl-martial-arts-ac...
3    www.tapology.com/gyms/4110-efl-martial-arts-ac...
4    www.tapology.com/gyms/4110-efl-martial-arts-ac...
Name: gym_url, dtype: object

In [77]:
# .str.lstrip('www.') removes the “www.” from the left side of the string
df['gym_url'] = df['gym_url'].str.lstrip('www.')

df['gym_url'].head()

0    tapology.com/gyms/4110-efl-martial-arts-academy
1    tapology.com/gyms/4110-efl-martial-arts-academy
2    tapology.com/gyms/4110-efl-martial-arts-academy
3    tapology.com/gyms/4110-efl-martial-arts-academy
4    tapology.com/gyms/4110-efl-martial-arts-academy
Name: gym_url, dtype: object

Amazing! Our __dataset is now much easier to read and use__. 

We have __identifiable columns and variables__ that are easy to work with thanks to our data wrangling process. 

We also __corrected illogical data values__ and __made the strings a little easier to read__.

In this example, we worked with data that was rather __tidy__, in the sense that __each row was an observation__ and __each column was a variable__. 

However, what if our dataset was not tidy? What if our columns and rows needed reorganization?

__Tidy Data__

Let’s take a look at a dataset that has information about the average annual wage for restaurant workers across New York City boroughs and New York City as a whole from the years 2000 and 2007. The data is from the New York State Department of Labor, Quarterly Census of Employment and Wages,and only contains six total rows.

In [86]:
df_toy = pd.DataFrame(
    {'borough': ['area_A', 'area_B', 'area_C'],
     '2000': [20_000, 25_000, 22_000],
     '2007': [18_000, 20_000, 25_000],
    }
)

df_toy

Unnamed: 0,borough,2000,2007
0,area_A,20000,18000
1,area_B,25000,20000
2,area_C,22000,25000


There are three variables in this dataset: `borough`, `year`, and `average annual income`. 

However, we have values (`2000` and `2007`) in the column headers rather than variable names (`year` and `average annual income`). This is not ideal to work with, so let’s fix it! 

We will use the `melt()` function in pandas to turn the current values (`2000` and `2007`) in the column headers into row values and add `year` and `avg_annual_wage` as our column labels.

In [88]:
df_toy_fixed=df_toy.melt(
 
      # which column to use as identifier variables
      id_vars=["borough"], 
 
      # column name to use for “variable” names/column headers (ie. 2000 and 2007) 
      var_name=["year"], 
 
      # column name for the values originally in the columns 2000 and 2007
      value_name="avg_annual_wage") 

df_toy_fixed

Unnamed: 0,borough,year,avg_annual_wage
0,area_A,2000,20000
1,area_B,2000,25000
2,area_C,2000,22000
3,area_A,2007,18000
4,area_B,2007,20000
5,area_C,2007,25000


Now we have a tidy dataset where each column is a variable (borough, year, or average annual wage), and each row is an observation! This dataset will be much easier to work with moving forward!

<a name="regexes"></a>
# 5.3 Regular Expressions
1. [Intro](#regexintro)
1. [Literals](#literals)
1. [Alternation](#alternation)
1. [Character Sets](#charsets)
1. [Wild for Wildcards](#wildcards)
1. [Ranges](#ranges)
1. [Shorthand Character Classes](#classes)
1. [Grouping](#grouping)
1. [Quantifiers - Fixed](#fixedquant)
1. [Quantifiers - Optional](#optionalquant)
1. [Quantifiers - 0 or More, 1 or More](#starplus)
1. [Anchors](#anchors)

<a name="regexintro"></a>
## 5.3.1 Intro

The technology that __fuels the verification systems__ on nearly every website and application is the ever reliable, often quirky language of __regular expressions__, commonly shortened to __regex__.

A regular expression is a __special sequence of characters__ that describe a __pattern of text__ that should be found, or matched, in a string or document.

In [26]:
import re

text = """
A regular expression (shortened as regex or regexp;[1] also referred to as rational expression[2][3]) is a 
sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching 
algorithms for "find" or "find and replace" operations on strings, or for input validation.

It is a technique developed in theoretical computer science and formal language theory.

The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene formalized 
the description of a regular language. They came into common use with Unix text-processing utilities. Different syntaxes
for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used,
being the Perl syntax.

Regular expressions are used in search engines, search and replace dialogs of word processors and text editors,
in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex
capabilities either built-in or via libraries, as it has uses in many situations.
"""
text

'\nA regular expression (shortened as regex or regexp;[1] also referred to as rational expression[2][3]) is a \nsequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching \nalgorithms for "find" or "find and replace" operations on strings, or for input validation.\n\nIt is a technique developed in theoretical computer science and formal language theory.\n\nThe concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene formalized \nthe description of a regular language. They came into common use with Unix text-processing utilities. Different syntaxes\nfor writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used,\nbeing the Perl syntax.\n\nRegular expressions are used in search engines, search and replace dialogs of word processors and text editors,\nin text processing utilities such as sed and AWK and in lexical analysis. Many program

<a name="literals"></a>
## 5.3.2 Literals

The simplest text we can match with regular expressions are __literals__; this is where our regular expression contains the __exact text__ that we want to match.

In [39]:
# define text
text = "I love baboons and I love gorillas"
# create regex pattern using literals
pattern = re.compile(r'love')
# find all matches
result = pattern.findall(text)
# show result
result

['love', 'love']

<a name="alternation"></a>
## 5.3.3 Alternation

You can __find two distinct phrases__ with the same regular expression using __alternation__. 

It is performed with the __pipe symbol__, `|`, and allows us to match either the characters preceding the `|` __OR__ the characters after the `|`.

In [36]:
# define text
text = "I love baboons and I love gorillas"
# create regex pattern using alternation
pattern = re.compile(r'I love baboons | I love gorillas')
# find all matches
result = pattern.findall(text)
# show result
result

['I love baboons ', ' I love gorillas']

<a name="charsets"></a>
## 5.3.4 Character Sets

Character sets, denoted by a pair of __square brackets__, `[]`, let us match one character from a series of characters, allowing for matches with incorrect or different spellings.

In [40]:
# define text
text = "consencus and concencus and concensus and concencus"
# create regex pattern using char sets
pattern = re.compile(r'con[sc]en[cs]us')
# find all matches
result = pattern.findall(text)
# show result
result

['consencus', 'concencus', 'concensus', 'concencus']

We can make our character sets even more powerful with the help of the __caret symbol__, `^`. Placed __at the front of a character set__, it __negates the set__, matching any character that is not stated. These are called negated character sets.

In [42]:
# define text
text = "consencus and concencus and concensus and concencus"
# create regex pattern using negated char sets 
pattern = re.compile(r'con[^s]en[cs]us')
# find all matches
result = pattern.findall(text)
# show result
result

['concencus', 'concensus', 'concencus']

<a name="wildcards"></a>
## 5.3.5 Wild for Wildcards

Wildcards will **match any single character** (letter, number, symbol or whitespace) in a piece of text. 

They are useful when we do not care about the specific value of a character, but only that a character exists.

In [44]:
# define text
text = "consencus and concencus and concensus and concencus"
# create regex pattern using wildcard
pattern = re.compile(r'con.en.us')
# find all matches
result = pattern.findall(text)
# show result
result

['consencus', 'concencus', 'concensus', 'concencus']

We can use the __escape character__, `\`, to escape the functionality of a __metacharacter__ and match the literal character.

In [47]:
# define text
text = "consencus and concencus and concensus and concencus."
# create regex pattern using escape char
pattern = re.compile(r'con[sc]en[sc]us\.')
# find all matches
result = pattern.findall(text)
# show result
result

['concencus.']

<a name="ranges"></a>
## 5.3.6 Ranges

Ranges allow us to __specify a range of characters__ in which we can make a match without having to type out each individual character. 

In [50]:
# define text
text = "consencus and concencus and concensus and concencus."
# create regex pattern using range
pattern = re.compile(r'[a-c]')
# find all matches
result = pattern.findall(text)
# show result
result

['c', 'c', 'a', 'c', 'c', 'c', 'a', 'c', 'c', 'a', 'c', 'c', 'c']

<a name="classes"></a>
## 5.3.7 Shorthand Character Classes

There are shorthand character classes that represent __common ranges__:

__`\w`__: the “word character” class represents the regex range `[A-Za-z0-9_]`, and it matches a single uppercase character, lowercase character, digit or underscore

__`\d`__: the “digit character” class represents the regex range `[0-9]`, and it matches a single digit character

__`\s`__: the “whitespace character” class represents the regex range `[ \t\r\n\f\v]`, matching a single space, tab, carriage return, line break, form feed, or vertical tab

In addition to the shorthand character classes, we also have access to the __negated shorthand character classes__ that will match any character that is NOT in the regular shorthand classes:

__`\W`__: the “non-word character” class represents the regex range `[^A-Za-z0-9_]`, matching any character that is not included in the range represented by \w

__`\D`__: the “non-digit character” class represents the regex range `[^0-9]`, matching any character that is not included in the range represented by \d

__`\S`__: the “non-whitespace character” class represents the regex range `[^ \t\r\n\f\v]`, matching any character that is not included in the range represented by \s

In [53]:
# define text
text = "I love baboons and I love gorillas"
# create regex pattern using shorthands
pattern = re.compile(r'\w\w\w\w')
# find all matches
result = pattern.findall(text)
# show result
result

['love', 'babo', 'love', 'gori', 'llas']

In [58]:
# define text
text = "I love baboons and I love gorillas"
# create regex pattern using negated shorthands
pattern = re.compile(r'\S')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['I', 'l', 'o', 'v', 'e', 'b', 'a', 'b', 'o', 'o', 'n', 's', 'a', 'n', 'd', 'I', 'l', 'o', 'v', 'e', 'g', 'o', 'r', 'i', 'l', 'l', 'a', 's']


<a name="grouping"></a>
## 5.3.8 Grouping

The `|` symbol matches the entire expression before or after itself, but __grouping__, `()`, lets us group parts of a regular expression together, and allows us to __limit alternation__ to part of the regex.

These groups are also called __capture groups__, as they have the power to select, or __capture__, a substring from our matched text.

In [64]:
# define text
text = "I love baboons and I love gorillas"
# create regex pattern using grouping
pattern = re.compile(r'I love (baboons|gorillas)')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['baboons', 'gorillas']


`(?:...)` is a __non-capturing version of regular parentheses__. 

Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

In [65]:
# define text
text = "I love baboons and I love gorillas"
# create regex pattern using grouping
pattern = re.compile(r'I love (?:baboons|gorillas)')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['I love baboons', 'I love gorillas']


<a name="fixedquant"></a>
## 5.3.9 Quantifiers - Fixed

So far we have only matched text on a **character by character basis**.

__Fixed quantifiers__, `{}`, let us indicate the __exact quantity of a character__ we wish to match, or allow us to provide a __quantity range__ to match on.

An important note is that quantifiers are considered to be __greedy__. This means that __they will match the greatest quantity of characters they possibly can__. 

In [66]:
# define text
text = "I love baboons and I love gorillas"
# create regex pattern using fixed quantifiers
pattern = re.compile(r'\w{4}')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['love', 'babo', 'love', 'gori', 'llas']


In [68]:
# define text
text = "I love baboons and I love gorillas"
# create regex pattern using fixed range quantifiers
pattern = re.compile(r'\w{1,4}')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['I', 'love', 'babo', 'ons', 'and', 'I', 'love', 'gori', 'llas']


<a name="optionalquant"></a>
## 5.3.10 Quantifiers - Optional

Optional quantifiers, `?`, allow us to __indicate a character in a regex is optional__, or can appear either 0 times or 1 time. 

Note the `?` only __applies to the character directly before it__.

In [71]:
# define text
text = "I love baboons and I love gorillas."
# create regex pattern using optional quantifiers
pattern = re.compile(r'\w{3,}\.?')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['love', 'baboons', 'and', 'love', 'gorillas.']


<a name="starplus"></a>
## 5.3.11 Quantifiers - 0 or More, 1 or More

The __Kleene star__, `*`, is also a __quantifier__, and __matches the preceding character 0 or more times__.

In [77]:
# define text
text = "The cat mews, and meoows, and meooows, and meoooooooooooows."
# create regex pattern using optional quantifiers
pattern = re.compile(r'meo*ws\.?')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['mews', 'meoows', 'meooows', 'meoooooooooooows.']


The __Kleene plus__, `+`, __matches the preceding character 1 or more times__.

In [78]:
# define text
text = "The cat mews, and meoows, and meooows, and meoooooooooooows."
# create regex pattern using optional quantifiers
pattern = re.compile(r'meo+ws\.?')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['meoows', 'meooows', 'meoooooooooooows.']


<a name="anchors"></a>
## 5.3.12 Anchors

The __anchors hat__, `^`, is used to __match text at the start of a string__.

In [85]:
# define text
text = "the cat mews, and the cat meoows, and meooows, and meoooooooooooows."
# create regex pattern using optional quantifiers
pattern = re.compile(r'^the')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['the']


The __anchors dollar sign__, `$`, is used to __match text at the end of a string__.

In [86]:
# define text
text = "The cat mews, and meoows, and meooows, and meoooooooooooows"
# create regex pattern using optional quantifiers
pattern = re.compile(r'ows$')
# find all matches
result = pattern.findall(text)
# show result
print(result)

['ows']


Regex Official [Documentation](https://docs.python.org/3/library/re.html#module-contents).

Useful regex sites for practice: [RegExr](https://regexr.com/) & [regex101](https://regex101.com/).