# Regular Expressions

**Objective** 
Use regular expressions to analyze a dataset with tweets and extract the following elements:
- Hashtags
- Users
- URL's
- Time (in any format)
- Emoticons in ASCII
- Emojis

Then print the 10 most common instances for each element

## Import Libraries

In [418]:
import re

## Function to Read CSV and store each tweet in a list

In [419]:
def read_csv(path):
    with open(path, 'r', encoding='utf-8') as file:
        # List to store the tweets
        tweets = []

        for line in file:
            line = line.strip()  # Give some format to the string if have extra blank spaces

            if line:  # If the current Tweet is not null or empty
                tweets.append(line)

    return tweets.copy()

In [420]:
# Read the file and store the data
tweets = read_csv('tweets.csv')

text
termine bien abrumado después de hoy
me siento abrumado
"Me siento un poco abrumado por la cantidad de cosas que quiero dibujar, ver, jugar y leer. Odio esta sensación xdddd"
Salvador la única persona que no la ha abrumado de versiones❤😒❤ #NadieComoTú


## Find Hashtags

The hashtag must meet the following requirements:

- Begin with '#'
- Can have at least 1 letter, number o underscore
- Has no spaces

In [421]:
regex = r'#\w+'  
count = 0
# Save the results in a dictionary to count the ocurrences
hashtags = dict()

for tweet in tweets:
    result = re.findall(regex, tweet)
    # If a result was found it is stored in the dictionary
    if result:
        # iterate over the resulting list and store each element
        for r in result:
            count += 1
            # if the hashtag is in the dictionary increase the frequency by one
            if r in hashtags:
                hashtags[r] += 1
            # store the element with frequency 1
            else:
                hashtags[r] = 1
                
# Sort the dictionary in descent order by value or frequency of hashtag
hashtags = dict(sorted(hashtags.items(), key=lambda item: item[1], reverse=True))


print(f'Frequency {count}')
# Iterate over the first 10 elements of the dictionary
for i, (key, value) in enumerate(hashtags.items()):
    if i < 10:
        print(f"{key} --- {value}")
    else:
        break

Frequency 298
#UnidosTodosX --- 26
#DeZurdaTeam --- 26
#GranHermano --- 21
#granhermano --- 9
#gelp --- 7
#OTDirecto5E --- 4
#gh23 --- 4
#NadieComoTú --- 3
#MicroCuento --- 3
#Bailando2023 --- 3


## Find Users

The user string must meet almost the same requirements as hashtags:

- Begin with '@'
- Can have at least 1 letter, number o underscore
- Has no spaces

In [422]:
regex = r'@\w+'  
count = 0
users = dict()

for user in tweets:
    result = re.findall(regex, user)
    # If a result was found it is stored in the dictionary
    if result:
        # iterate over the resulting list and store each element
        for r in result:
            count += 1
            if r in users:
                users[r] += 1
            # store the element with frequency 1
            else:
                users[r] = 1
                 
users = dict(sorted(users.items(), key=lambda item: item[1], reverse=True))

print(f'Frequency {count}')
for i, (key, value) in enumerate(users.items()):
    if i < 10:
        print(f"{key} ---- {value}")
    else:
        break

Frequency 194
@petrogustavo ---- 7
@DeZurdaTeam_ ---- 6
@JMilei ---- 4
@biobio ---- 3
@radiocarab ---- 3
@TTISantiago ---- 3
@mop_chile ---- 3
@mop_rm ---- 3
@MabelLaraNews ---- 2
@_somosmadrid ---- 2


## URLs

We will look for urls that have an explicit protocol, so they must meet the following requirements:

- Begin with 'http'
- May or may not have 1 's' after 'http'
- After the protocol must have '://'
- After the '://' can have any other character

Since the regex range method works with ascii we will make use of this knowledge to search from # to ~ in ascci

In [423]:
regex = r'https?://[!#-~]+'
count = 0
users = dict()

for user in tweets:
    result = re.findall(regex, user)
    if result:
        for r in result:
            count += 1
            if r in users:
                users[r] += 1
            else:
                users[r] = 1
                
# Sort the dictionary in descent order by value or frequencey
users = dict(sorted(users.items(), key=lambda item: item[1], reverse=True))

print(f'Frequency {count}')
# Iterate over the first 10 elements of the dictionary
for i, (key, value) in enumerate(users.items()):
    if i < 10:
        print(f"{key} ---- {value}")
    else:
        break

Frequency 8
https://www.eldiario.es/1_a4fa72?utm_campaign=botonera-share&utm_medium=social&utm_source=twitter ---- 2
https://elfaro.net/es/202312/columnas/27191/el-voto-de-la-frustracion-gano-el-tour-electoral-de-2023 ---- 1
https://www.youtube.com/watch?v=1QvBbg38UY8&ab_channel=ElMostrador ---- 1
https://nitter.perennialte.ch/MaxKaiser75/status/1743261455326007754/video/1 ---- 1
https://signal.group/#CjQKIAL1PfYMtji-3OMw24eFifKyZSI9bNbHpdvfWONAMrnvEhAgxrDIgXSX8-35VZTa6H_n ---- 1
https://t.me/tierrasant ---- 1
https://twitter.com/MaxKaiser75/status/1743261455326007754/video/1 ---- 1


## Time 

- Must begin with a number 
- Can be in these formats:
    - 00:00
    - 00 am
    - 00 pm
    - 00 hrs

In [424]:
# un número de 1 a 2 dígitos, dos puntos, un número de dos dígitos, espacio (cero o más)
regex = r'([1]\d|00|[2][0-4])(:[0-5]\d)?(\s?horas)?(\s?[ap]m)?'
count = 0

# Save the results in a dictionary to count the ocurrences
time = dict()

for tweet in tweets:
    result = re.search(regex, tweet)
    # If a result was found then validate it
    if result:
        string = str(result.group())
        # If the result has more than 2 characters (if there are not simple numbers like 23 or 1)
        if len(string) > 2:
            count += 1
            if string in time:
                time[string] += 1
            # store the element with frequency 1
            else:
                time[string] = 1
   
time = dict(sorted(time.items(), key=lambda item: item[1], reverse=True))

print(f'Frequency {count}')
# Iterate over the first 10 elements of the dictionary
for i, (key, value) in enumerate(time.items()):
    if i < 10:
        print(f"{key} ---- {value}")
    else:
        break


Frequency 14
23:58 ---- 4
17:45 ---- 3
11:11 ---- 2
10 pm ---- 2
24 horas ---- 1
12 horas ---- 1
00:01 ---- 1


## Emoticons in ASCII

- Can have the eyes before of after the face
- The faces can be:
    - : p
    - :\)
    - :\(
    - :D

In [425]:
regex = r':?[\)\(pD]+:?'
emoticons = dict()
count = 0

for tweet in tweets:
    result = re.findall(regex, tweet)
    # If a result was found it is stored in the dictionary
    if result:
        # iterate over the resulting list and store each element
        for r in result:
            if r[0] == ':' or r[-1] == ':':
                count += 1
                if r in emoticons:
                    emoticons[r] += 1
                # store the element with frequency 1
                else:
                    emoticons[r] = 1
                
# Sort the dictionary in descent order by value or frequency
emoticons = dict(sorted(emoticons.items(), key=lambda item: item[1], reverse=True))

print(f'Frequency {count}')
# Iterate over the first 10 elements of the dictionary
for i, (key, value) in enumerate(emoticons.items()):
    if i < 10:
        print(f"{key} ---- {value}")
    else:
        break

Frequency 14
:) ---- 4
): ---- 4
:))) ---- 2
:( ---- 2
:)) ---- 1
:pp ---- 1


## Emojis 

Find all the possible emojis like 😀

In [433]:
regex = r'[^ ][\u263a-\U0001f645]'
count = 0
emojis = dict()

for tweet in tweets:
    result = re.findall(regex, tweet)
    # If a result was found it is stored in the dictionary
    if result:
        # iterate over the resulting list and store each element
        for r in result:
            count += 1
            # if the hashtag is in the dictionary increase the frequency by one
            if r in emojis:
                emojis[r] += 1
            # store the element with frequency 1
            else:
                emojis[r] = 1
                
# Sort the dictionary in descent order by value or frequencey of hashtag
emojis = dict(sorted(emojis.items(), key=lambda item: item[1], reverse=True))

print(f'Frequency {count}')
# Iterate over the first 10 elements of the dictionary
for i, (key, value) in enumerate(emojis.items()):
    if i < 10:
        print(f"{key} --- {value}")
    else:
        break

Frequency 502
❤️ --- 28
😭😭 --- 18
🙏🏻 --- 12
‍♀ --- 9
‍♂ --- 7
🫶🏻 --- 7
‍💫 --- 6
🤷😂 --- 6
"⠀ --- 6
⚽⚽ --- 6
