# Import modules

In [1]:
import pandas as pd

### Save csv data to database variable df

In [2]:
df = pd.read_csv("../../static/documents/regex_pandax_column.csv")

### Inspect Dataframe Dimensions

In [3]:
df.shape

(42, 700)

There are 700 columns and 42 rows

In [4]:
print(f'The number of columns in this dataset is: {len(df.columns)}')

The number of columns in this dataset is: 700


### Inspect Dataframe

In [5]:
df.head(3)

Unnamed: 0,FORM[0].Section1[0].Header1[0].FirstName[0],FORM[0].Section1[0].Header1[0].LastName[0],FORM[0].Section1[0].Header1[0].Datecompleted[0],FORM[0].Section1[0].Header1[0].Rank[0],FORM[0].Section1[0].Header1[0].Tenure[0],FORM[0].Section1[0].Header1[0].Timeperiodstartdate[0],FORM[0].Section1[0].Header1[0].AcademicActivities[0],FORM[0].Section1[0].Header1[0].ClinicalPractice[0],FORM[0].Section1[0].Header1[0].Scholarly[0],FORM[0].Section1[0].Header1[0].ExternalProfessional[0],...,FORM[0].SciAbstracts[0].Row1[7].SciAbstractAdditionalroles[0],FORM[0].SciAbstracts[0].Row1[7].SciAbstractTypeofmeeting[0],FORM[0].SciAbstracts[0].Row1[7].SciAbstractPodPoster[0],FORM[0].SciAbstracts[0].Row1[8].SpeakerAuthor[0],FORM[0].SciAbstracts[0].Row1[8].SpeakerDVMStudent[0],FORM[0].SciAbstracts[0].Row1[8].SpeakerTitle[0],FORM[0].SciAbstracts[0].Row1[8].SciAbstractDates[0],FORM[0].SciAbstracts[0].Row1[8].SciAbstractAdditionalroles[0],FORM[0].SciAbstracts[0].Row1[8].SciAbstractTypeofmeeting[0],FORM[0].SciAbstracts[0].Row1[8].SciAbstractPodPoster[0]
0,SHHM,XSRQ,BLHB,CRWY,VQHG,AGYD,RCQC,GLQU,WDFV,JPYA,...,BANG,PIMY,EFUO,XCAB,RNXB,XDQH,MXQL,QDSJ,NWRK,AJKI
1,CYJX,WJKW,YGEV,EHGG,GBMR,BQQH,UQVS,QNGY,DRCQ,DJHS,...,CLFT,SMMD,LJNO,SOMG,DXUL,DEHL,WRQH,BHFC,FWGU,AQGM
2,PEOA,TSSF,XXRG,SKRG,LSID,FHKU,YQCQ,TECF,YPJV,QARI,...,UWWH,YFEM,BLTK,GWMV,PMNX,RDUA,BEGG,JHWJ,BOXX,GSQB


Column headers are difficult to read, the file contains metadata information for the fields the data was exported.

### Inspect a subset of the Columns

In [6]:
df.columns[0:10]

Index(['FORM[0].Section1[0].Header1[0].FirstName[0]',
       'FORM[0].Section1[0].Header1[0].LastName[0]',
       'FORM[0].Section1[0].Header1[0].Datecompleted[0]',
       'FORM[0].Section1[0].Header1[0].Rank[0]',
       'FORM[0].Section1[0].Header1[0].Tenure[0]',
       'FORM[0].Section1[0].Header1[0].Timeperiodstartdate[0]',
       'FORM[0].Section1[0].Header1[0].AcademicActivities[0]',
       'FORM[0].Section1[0].Header1[0].ClinicalPractice[0]',
       'FORM[0].Section1[0].Header1[0].Scholarly[0]',
       'FORM[0].Section1[0].Header1[0].ExternalProfessional[0]'],
      dtype='object')

The column headers from the data exported from an adobe form has extra information about the document itself, and the actual column headers are embedded deep inside these strings.<br><br>
FORM[0].Section1[0].Header1[0].<font color ="red">FirstName</font>[0]

# Goal: Use pandas and Regular Expression to 

### Extract Column Name using <font color="red">pandas.Series.str.extract</font>

 - This method will capture groups in the regex pattern as columns in a DataFrame.
 - <code>Series.str.extract(self, pat, flags=0, expand=True)</code>
 - Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html


In [7]:
pattern = r'((?<=\.)[a-zA-Z0-9]+(?=\[0\]$))'

### Pattern explained

<b>Code: </b><code>r''</code><br>
<em>What is this?</em> <br>
This is a special string designation that let's python know that all the characters inside the quotations will be evaluated differently using the <code>re</code> module.<br>
<br>
<b>Code: </b><code>(?<=\\.)</code><br>
<em>What is this?</em>
<br>
This is a <font color="blue">positive lookbehind assertion</font>, which helps set the starting point for the pattern you want to match.
<br>
In this example, we want to match the red <code>.</code> that comes before the actual column name we want: "FirstName"
<br>
FORM[0].Section1[0].Header1[0]<font color ="red"><b>.</b></font>FirstName[0]
<br>
Note, there are three other <font color ="red"><b>.</b></font>'s in this string so we need a way to just select the last one as the starting point.<br> 
<br>
<b>Code: </b> <code>\(?=\\[0\\]$\) </code><br>
<em>What is this?</em><br>
    
This is a <font color="blue">lookahead assertion</font> that finds the first expression that has the pattern <code>[0]</code> after it<br>
The slashes in the <code>[\\[0\\]]</code> tell the re module that the brackets are actual characters in the pattern, without the backslashes, the re module will interpret the brackets differentently.<br>
<br> 
The <code>&dollar;</code> character tells the re module to only look for this pattern at the end of the string.<br>
Taken together,  <code>(?=\\[0\\]&dollar;)</code> tells the re module to only match strings at the end of string.<br>
This will ensure we are only getting the column name<br>
<br>
<b>Code: </b><code>\\[a-zA-Z0-9\]+</code><br>
<em>What is this?</em><br>
<code>a-zA-Z</code> will match any alphabetic character.<br>
<code>0-9</code> will match any digit between 0-9.<br>
<code>+</code> will allow more than one of these characters.<br>
Taken together, this will allow any word with numbers to exist between the lookahead assertion and the positive lookbehind assertion.

### Infograhic Summary

<img src="static/images/pandas_regex_blog_explained.png"></img>

### Replace all column headers with extracted substring

In [8]:
# This will replace all the column (csv header strings) with the extracted substring
df.columns = df.columns.str.extract(pattern)

In [9]:
df.columns[0:5]

Index([('FirstName',), ('LastName',), ('Datecompleted',), ('Rank',),
       ('Tenure',)],
      dtype='object')

### The returned data is a tuple!

In [10]:
print(f'The returned datatype is a Tuple: {isinstance(df.columns[0], tuple)}!')

The returned datatype is a Tuple: True!


### Use a list comprehension to replace tuple header columns with a single substring

### Step 1: Prepare the for loop

<font color="blue">for</font> <strong><font color="orange">x, y</font></strong> in <font color="red">enumerate(df.columns)</font>:
<p style="margin-left:10px; margin-right:50px;">
    <font color="purple">df.columns</font>.append(<strong><font color="991B05">df.columns[<font color="orange">x</font>][0]</font></strong>)<br></p>
<hr style="border: 1px solid black;">

### Step 2: Rearrange into a list comprehension

<strong><font color="purple">new list</font></strong> = [<strong><font color="991B05">expression</font></strong> <font color="blue">for</font> <strong><font color="orange">item</font></strong> in <font color="red">old sequence</font>]<br>

<strong><font color="purple">df.columns</font></strong> = [<strong><font color="991B05">df.columns[<font color="orange">x</font>][0]</font></strong> <font color="blue">for</font> <strong><font color="orange">x, y</font></strong> in <font color="red">enumerate(df.columns)</font>]

<hr style="border: 1px solid black;">

In [11]:
df.columns = [df.columns[x][0] for x, y in enumerate(df.columns)]

### Confirm tuples replaced with single column substring

In [12]:
df.columns[0:5]

Index(['FirstName', 'LastName', 'Datecompleted', 'Rank', 'Tenure'], dtype='object')

In [13]:
print(f'The returned datatype is a Tuple: {isinstance(df.columns[0], tuple)}!')

The returned datatype is a Tuple: False!
