## Regular expressions

- S√¶rlige tekststykker brugt til at matche specifikke tekststykker
- G√∏r brug af specialtegn for at finde m√∏nstre i tekst

**Eksempler p√• brug**
- Finde ord af en bestemt l√¶ngde
- Finde s√¶tninger med en bestemt opbygning
- Finde ord der opfylder bestemte kriterier

I Python kan man danne regular expressions med pakken `re` (del af Pythons standardbibliotek).

Regular expressions kan bruges i fx pandas til at finde specifikke tekststykker (frem for bare at s√∏ge p√• ord).

### Eksempel

In [30]:
cppasta = "Did you ever hear the tragedy of Darth Plagueis The Wise? I thought not. It‚Äôs not a story the Jedi would tell you. It‚Äôs a Sith legend. Darth Plagueis was a Dark Lord of the Sith, so powerful and so wise he could use the Force to influence the midichlorians to create life‚Ä¶ He had such a knowledge of the dark side that he could even keep the ones he cared about from dying. The dark side of the Force is a pathway to many abilities some consider to be unnatural. He became so powerful‚Ä¶ the only thing he was afraid of was losing his power, which eventually, of course, he did. Unfortunately, he taught his apprentice everything he knew, then his apprentice killed him in his sleep. Ironic. He could save others from death, but not himself."
print(cppasta)

Did you ever hear the tragedy of Darth Plagueis The Wise? I thought not. It‚Äôs not a story the Jedi would tell you. It‚Äôs a Sith legend. Darth Plagueis was a Dark Lord of the Sith, so powerful and so wise he could use the Force to influence the midichlorians to create life‚Ä¶ He had such a knowledge of the dark side that he could even keep the ones he cared about from dying. The dark side of the Force is a pathway to many abilities some consider to be unnatural. He became so powerful‚Ä¶ the only thing he was afraid of was losing his power, which eventually, of course, he did. Unfortunately, he taught his apprentice everything he knew, then his apprentice killed him in his sleep. Ironic. He could save others from death, but not himself.


In [31]:
import re

regex = re.compile(r"\bs\w{3,12}", re.IGNORECASE)
regex.findall(cppasta)

['story', 'Sith', 'Sith', 'such', 'side', 'side', 'some', 'sleep', 'save']

### Match typer af tegn

|Character|Description|
|--|--|
|\w|Any word character|
|\W|Non-word character (like a space or newline)|
|\d|Digit|
|\s|Whitespace|
|\S|Non-whitespace|
|\n|Newline|
|\b|Word boundary|
|.|Any character|

### Bestemte m√∏nstre

|Character|Description|
|--|--|
|^|Match beginning of string|
|$|Match end of string|
|\||Either or|
|?|Match zero or once|
|+|Match once or more|
|*|Match zero or more|
|{x,y}|Match between x and y times|

In [32]:
import pandas as pd

tweetdata_url = "https://raw.githubusercontent.com/CALDISS-AAU/course_ndms-I/master/datasets/poltweets_sample.csv"
tweets_df = pd.read_csv(tweetdata_url)

tweets_df.head()

Unnamed: 0,created_at,id,full_text,is_quote_status,retweet_count,favorite_count,favorited,retweeted,is_retweet,hashtags,urls,user_followers_count,party
0,2020-10-21 14:48:39+00:00,1318927184111730700,Er p√• vej i milj√∏ministeriet for at foresl√• at...,False,13,47,False,False,False,['dkgreen'],[],4064,Alternativet
1,2019-06-02 20:03:20+00:00,1135275725592891400,@nielscallesoe @helenehagel @alternativet_ Det...,False,0,1,False,False,False,[],[],4064,Alternativet
2,2016-03-10 09:07:52+00:00,707855478320189400,"Vi st√•r sammen, smiler L√∏kke p√• KL-topm√∏de og ...",False,13,14,False,False,False,"['dkpol', 'KLtop16']",[],4064,Alternativet
3,2019-04-07 19:59:03+00:00,1114980930467315700,@AnnaBylov @EU_Spring @rasmusnordqvist üíö,False,0,2,False,False,False,[],[],4064,Alternativet
4,2017-05-28 09:59:26+00:00,868768670427828200,Der er ikke noget alternativ til at Alternativ...,False,6,28,False,False,False,['LM√Ö17'],"[{'url': 'https://t.co/3MCdZZGKRq', 'expanded_...",4064,Alternativet


In [34]:
tweets_sub = tweets_df.loc[tweets_df['full_text'].str.contains("klima"), :]
print(tweets_sub.shape)
tweets_sub.head()

(160, 13)


Unnamed: 0,created_at,id,full_text,is_quote_status,retweet_count,favorite_count,favorited,retweeted,is_retweet,hashtags,urls,user_followers_count,party
1,2019-06-02 20:03:20+00:00,1135275725592891400,@nielscallesoe @helenehagel @alternativet_ Det...,False,0,1,False,False,False,[],[],4064,Alternativet
4,2017-05-28 09:59:26+00:00,868768670427828200,Der er ikke noget alternativ til at Alternativ...,False,6,28,False,False,False,['LM√Ö17'],"[{'url': 'https://t.co/3MCdZZGKRq', 'expanded_...",4064,Alternativet
12,2020-11-02 23:24:00+00:00,1323405529683763200,@JanGuldager @alternativet_ S√• n√•r vi aldrig v...,False,0,3,False,False,False,[],[],4072,Alternativet
25,2020-10-30 11:09:39+00:00,1322133560308961300,"Hvis du kan l√¶se gadens dagsorden, kan du l√¶se...",False,4,17,False,False,False,"['Dkpol', 'dkklima']",[],4072,Alternativet
49,2016-10-06 13:47:41+00:00,784027350774255600,"Rasmus Nordqvist, √•bningstale: 2025 plan skade...",False,4,6,False,False,False,['dkpol'],[],4064,Alternativet


In [35]:
import re

regex = re.compile(r"\bklima\b", re.IGNORECASE)

In [36]:
tweets_sub = tweets_df.loc[tweets_df['full_text'].str.contains(regex), :]
print(tweets_sub.shape)
tweets_sub.head()

(45, 13)


Unnamed: 0,created_at,id,full_text,is_quote_status,retweet_count,favorite_count,favorited,retweeted,is_retweet,hashtags,urls,user_followers_count,party
49,2016-10-06 13:47:41+00:00,784027350774255600,"Rasmus Nordqvist, √•bningstale: 2025 plan skade...",False,4,6,False,False,False,['dkpol'],[],4064,Alternativet
168,2019-01-28 11:34:07+00:00,1089849093881503700,S√• er min valgkampagne for alvor skudt igang f...,False,7,35,False,False,False,['dkgreen'],[],4064,Alternativet
444,2020-02-19 10:43:35+00:00,1230080481905008600,MS p√• foretr√¶de i besk√¶ftigelsesudvalget i dag...,False,1,2,False,False,False,"['klima', 'dkpol']",[],4064,Alternativet
634,2019-06-26 16:24:25+00:00,1143917944910614500,Egentlig fine hensigtserkl√¶ringer i regeringsa...,False,6,37,False,False,False,['dkpol'],[],12276,Dansk Folkeparti
971,2020-09-25 12:16:13+00:00,1309466735309926400,Det k√∏lige overblik üòÅ #klima https://t.co/YCNT...,False,2,18,False,False,False,['klima'],"[{'url': 'https://t.co/YCNT6iCAT9', 'expanded_...",25348,Dansk Folkeparti


In [39]:
print(tweets_df.shape)

(5500, 13)


In [40]:
lw_regex = re.compile(r"\w{20,30}", re.IGNORECASE)

In [41]:
tweets_sub = tweets_df.loc[tweets_df['full_text'].str.contains(lw_regex), :]
print(tweets_sub.shape)
tweets_sub.head()

(281, 13)


Unnamed: 0,created_at,id,full_text,is_quote_status,retweet_count,favorite_count,favorited,retweeted,is_retweet,hashtags,urls,user_followers_count,party
39,2016-09-10 11:40:39+00:00,774573295169568800,P√• gr√¶s med h√∏nsene til B√¶redygtighedsfestival...,False,2,7,False,False,False,"['venligrevolution', 'gokgok', 'dkgreen']",[],4064,Alternativet
51,2017-02-23 19:27:19+00:00,834847123715870700,Lad os erstatte tvang med tillid i besk√¶ftigel...,False,7,18,False,False,False,"['venligrevolution', 'dkpol', 'dksocial']","[{'url': 'https://t.co/bhvnCbwYha', 'expanded_...",4064,Alternativet
80,2016-07-23 21:41:49+00:00,756967579164479500,"Tak for tydelighed, SF...men hvad med S? Vil I...",False,9,10,False,False,False,"['dkpol', 'dksocial']","[{'url': 'https://t.co/WYbOZv9iZh', 'expanded_...",4064,Alternativet
126,2019-10-01 12:22:26+00:00,1179008671130497000,Fint af Mette Frederiksen at basere √•bningstal...,False,12,103,False,False,False,[],[],4064,Alternativet
153,2020-11-09 13:41:19+00:00,1325795608167321600,"M√•ske kan S, R eller √ò anvise en aktuel redukt...",False,12,24,False,False,False,['dkgreen'],"[{'url': 'https://t.co/J5SzVU6Ytz', 'expanded_...",4092,Alternativet


In [42]:
lw_regex = re.compile(r"(\w{20,30})", re.IGNORECASE)

In [44]:
list(tweets_df['full_text'].str.extractall(lw_regex)[0])

['B√¶redygtighedsfestival',
 'besk√¶ftigelsessystemet',
 'f√∏rtidspensionsforliget',
 'besk√¶ftigelsessystem',
 'klimaforhandlingerne',
 'kontanthj√¶lpsmodtagere',
 'f√¶llesskabsmilliarden',
 'besk√¶ftigelsesminister',
 'besk√¶ftigelsesministeren',
 'fattigdomsreformerne',
 'forvaltningsdomstole',
 'mindretalsbeskyttelse',
 'besk√¶ftigelsesministeriet',
 'Demokratikommissionen',
 'Socialr√•dgiverforening',
 'ligestillingsudfordring',
 'handicapkonventionen',
 'besk√¶ftigelsessystem',
 'omfordelingsprojekter',
 'vegetarburgerunder√∏gelse',
 'kontanthj√¶lpsmodtagere',
 'kontanthj√¶lpsmodtager',
 'Demokratikommissionen',
 'Verdensm√•lskonference',
 'arbejdsmarkedspolitik',
 'besk√¶ftigelsesudvalget',
 'temperaturstigningerne',
 'finansieringsmodeller',
 'besk√¶ftigelsespolitik',
 'besk√¶ftigelsesomr√•det',
 'kommissionsunders√∏gelse',
 'Uddannelsesministeren',
 'udl√¶ndingeministeren',
 'Granskningskommissionen',
 'klippekortsordningen',
 'seniorf√∏rtidspension',
 'Ytringsfrihedskommissio

In [50]:
klima_regex = re.compile(r"(\bklima\w+)", re.IGNORECASE)

In [51]:
list(tweets_df['full_text'].str.extractall(klima_regex)[0])

['klimaet',
 'klimaafgifter',
 'Klimaet',
 'klimaet',
 'klimaDanmark',
 'klimaet',
 'klimaet',
 'klimamyter',
 'klimamyteknuser',
 'klimaforhandlingerne',
 'klimalovgivning',
 'klimamin',
 'klimalove',
 'klimaaktion',
 'klimamarchen',
 'klimamin',
 'klimakrisen',
 'klimaflygtninge',
 'klimabistand',
 'klimamarch',
 'klimasikring',
 'klimainvesteringer',
 'klimaet',
 'klimamarchen',
 'klimaflertal',
 'klimaproblemerne',
 'klimamarcherne',
 'klimam√•let',
 'klimakrisen',
 'Klimapolitik',
 'klimafesten',
 'klimaet',
 'klimastrejke',
 'klimal√∏sninger',
 'klimavenligt',
 'klimaneutralt',
 'klimaet',
 'klimavenlig',
 'klimauge',
 'klimaet',
 'Klimaet',
 'klimavalg',
 'Klimaforandringerne',
 'klimatruslen',
 'klimatiltag',
 'klimaet',
 'klimaet',
 'klimaets',
 'klimaregning',
 'klimakrisen',
 'klimaudspil',
 'Klimaloven',
 'klimahandlingsplaner',
 'klimalov',
 'klimalove',
 'klimarapport',
 'klimam√•l',
 'klimakrisen',
 'Klimaet',
 'Klimaproblemer',
 'klimar√•dets',
 'klimasvigt',
 'klimamin