# Extract hashtags using Regex

In [45]:
import pandas as pd
import re
import requests

We will use the requests.get function to get the data from the web address (URL). Python will send a GET-request to the server. Of we receive the status code 200 we have gotten an OK. 

In [46]:
url = 'https://raw.githubusercontent.com/su-mt4007/data/refs/heads/main/comments.txt'
response = requests.get(url)
response

<Response [200]>

In [47]:
response.text[:1000]

'1. "Great post! #programming #tips"\n2. "Enjoyed the article. #coding #python"\n3. "Interesting insights. #tech #data"\n4. "This is awesome! #programming #coding"\n5. "Thanks for sharing. #data #analysis"\n6. "I learned a lot. #programming #python #tips"\n7. "Cool stuff! #tech #innovation"\n8. "Amazing read. #coding #python"\n9. "Impressive content. #data #analytics"\n10. "Inspiring! #programming #tips"\n11. "Helpful tutorial. #coding #python"\n12. "I agree with the points raised. #tech #data"\n13. "This is so useful! #programming #coding"\n14. "Interesting findings. #data #insights"\n15. "Well explained. #programming #python #tips"\n16. "Exciting discoveries. #tech #research"\n17. "Brilliant insights. #coding #python"\n18. "Insightful analysis. #data #analytics"\n19. "This changed my perspective. #programming #tips"\n20. "Innovative ideas. #coding #innovation"\n21. "Love the content! #programming #python #tips"\n22. "Impressed with the insights. #tech #data"\n23. "Useful tips! #codin

We want to make every row a sentence. \n2 in the output above represent a new row. For this we will use splitlines() which returns a list with where every element is a row. 

In [48]:
comments = response.text.splitlines() 
comments[:10]

['1. "Great post! #programming #tips"',
 '2. "Enjoyed the article. #coding #python"',
 '3. "Interesting insights. #tech #data"',
 '4. "This is awesome! #programming #coding"',
 '5. "Thanks for sharing. #data #analysis"',
 '6. "I learned a lot. #programming #python #tips"',
 '7. "Cool stuff! #tech #innovation"',
 '8. "Amazing read. #coding #python"',
 '9. "Impressive content. #data #analytics"',
 '10. "Inspiring! #programming #tips"']

### We will now extract the hashtags from the data. 

In [51]:
hashtags_per_comment = [re.findall(r"#\w+", line) for line in comments] 
hashtags_per_comment[:10] 

[['#programming', '#tips'],
 ['#coding', '#python'],
 ['#tech', '#data'],
 ['#programming', '#coding'],
 ['#data', '#analysis'],
 ['#programming', '#python', '#tips'],
 ['#tech', '#innovation'],
 ['#coding', '#python'],
 ['#data', '#analytics'],
 ['#programming', '#tips']]

#### Extract the two words #python and #programming from the text

##### Strategy 1

In [5]:
pattern = r'^(?=.*#programming)(?=.*#python).*$'
matches = [c for c in comments if re.search(pattern, c)]
matches

['6. "I learned a lot. #programming #python #tips"',
 '15. "Well explained. #programming #python #tips"',
 '21. "Love the content! #programming #python #tips"',
 '30. "Inspired by the tips. #programming #python #tips"']

##### Strategy 2

In [6]:
def contains_programming_and_python(comment):
    return bool(re.search(r"#programming", comment) and re.search(r"#python", comment))

comments_with_both = [comment for comment in comments if contains_programming_and_python(comment)]
comments_with_both

['6. "I learned a lot. #programming #python #tips"',
 '15. "Well explained. #programming #python #tips"',
 '21. "Love the content! #programming #python #tips"',
 '30. "Inspired by the tips. #programming #python #tips"']