# Project Overview

Matthew Morgan, B.S

### Introduction

Scraping Twitter for data relating to the field and profession of Genetic Counseling. 
<br>
<br>A common problem in entering the genetic counseling field is lack of exposure and difficulty in finding that exposure to the field of genetic counseling. With many resources being scattered and posted on different channels, it is difficult to locate and combine everything. The goal of this project is to scrape links and information from the internet related to genetic counseling and pool those together as a conglomerate of resources for prospective genetic counseling students to use. This will hopefully provide a more convenient resource for finding exposure and opportunites in genetic counseling.

### Related Works

### Methods

**Focusing on Twitter:**
<br>1: Look for common twitter tags that relate to genetic counseling
<br>&emsp; a. Got event advertisements from 08/2020 - 07/2022 from a discord channel for prospective genetic counseling students.
<br>&emsp; b. Filtered through these events to create a wordbank of common words to find genetic counseling events for prospective students.
<br>&emsp; c. Used https://databasic.io/en/wordcounter/results/62cb4b55d16c580354d7dd92 to find most common words. Then analyzed to select the words that made &emsp;&emsp;up at least .05% of the word pool. These words were then cleaned further and used to search for common hashtags
<br>&emsp; d. Scraped steps a-c because tweepy rate-limits the number of tweets that can be pulled at 900 tweets / 15minutes. Decided to just use the phrase &emsp;&emsp;"prospective student genetic counseling" and "genetic counseling" to search for tweet resources.
<br>2: Use those tags to find posts that contain links or resources advertisements, can also look for events and webinars.
<br>3: Analyze tweets for authenticity and usefullness using follower count, length of account, quality of interaction (positive or negative), and more.
<br>4: create a sorted list of resources with descriptions and helpful information
<br>5: (TBD) advertise the list (possibly as a website or a post somewhere? TBD)
<br><br>Started by using Selenium Chromedriver to gather tweet data, but it soon became apparent that this would be alot of work and would require constant updates as both Chrome and Twitter constantly update HTML format and versions. Next the Tweepy Twitter API was attempted to be used and offered a much simpler way of gathering Twitter data.
<br><br>Then realized I can use tweepy to just get the dataset of relevant tweets from twitter using hashtags as a form of a "broad filter" to get genetic counseling tweets. Can first search genetic counseling keywords to find most common hashtags to help find desired tweets. Can then use those hashtags to extract relevant tweets into a dataset that can be analyzed.

### Code

In [38]:
import tweepy
import config
from collections import Counter
from nltk.corpus import stopwords #download ntlk stopwords as well
import calendar
import re

In [2]:
#filter through GC events to find most common keywords

filename = "methods/commonWords.txt"

file = open(filename, "r")

words = []
 
# Traversing file line by line
for line in file:
    line = line.strip()
    if (any(char.isdigit() for char in line) == False):
        words.append(line)
file.close();

find hashtags

In [3]:
#filter through GC events to find most common keywords
api_key = config.api_key
api_secrets = config.api_key_secret
access_token = config.access_token
access_secret = config.access_token_secret

# Authenticate to Twitter
auth = tweepy.OAuthHandler(api_key,api_secrets)
auth.set_access_token(access_token,access_secret)
api = tweepy.API(auth)

#find and store genetic counseling tweets for prospective students
def findCommonHashTags(keyWord):
    hashtags = []
    for tweet in tweepy.Cursor(api.search, q=keyWord, count=10).items():
        hashtagList = tweet.entities.get('hashtags')
        for i in range(len(hashtagList)):
            hashtags.append(hashtagList[i]['text'])
        pass
    return hashtags

hashtags = []
searchTermList = ['genetic counseling','prospective student genetic counseling']
for i in range(len(searchTermList)):
    hashtags += findCommonHashTags(searchTermList[i])

counter = Counter(hashtags)
freqHashtags = counter.most_common(3)

print(freqHashtags)

[(u'ESCGenomics', 9), (u'CreutzfeldtJakob', 6), (u'cancer', 5)]


Common results are: "GeneChat", "GeneticCounselors", "GeneticCounseling"

In [4]:
def getResources(keyWord):
    links = []
    events = []
    for tweet in tweepy.Cursor(api.search, q=keyWord, count=100).items():
        events.append(tweet.text) #get text to thrift through?
        linkList = tweet.entities.get('urls')
        for i in range(len(linkList)):#cant get unwound url without enterprise api
            links.append(linkList[i]['url'])#can get html title and description
    resources = {'links':links,'events':events}
    return resources

resources = []
searchTermList = ['#GeneChat', "#GeneticCounselors", "#GeneticCounseling"]
for i in range(len(searchTermList)):
    resources.append(getResources(searchTermList[i]))

In [5]:
events = []
links = []
for i in range(len(resources)):
    events += resources[i]['events']
    links += resources[i]['links']

need to workout the links and find a new way to connect resources, for now working on events

In [6]:
for i in range(len(events)):
    events[i] = events[i].lower()
events

[u'100% this \U0001f447\U0001f3fc\U0001f447\U0001f3fc\U0001f447\U0001f3fc  take care of yourself, gc friends! #genechat https://t.co/fgt5bzzqxe',
 u'update: the only way to know where my student loans are right now is to download a random txt file full of code fro\u2026 https://t.co/jx3jy08weh',
 u'rt @scientificturtl: good luck to all the gcs taking boards this month!\n\nremember, you are more than your score. not passing this time does\u2026',
 u'#genechat https://t.co/bbdvmf7qkj',
 u'rt @curemsd: student ambassador applications are now live! apply today: https://t.co/9afpqsiees\n*\n #togetherwecan #curemsd #msd #multiplesu\u2026',
 u"rt @bizbiz93: are you looking to become a genetic counselor but don't have local opportunities to work with one? feel free to check out our\u2026",
 u'leaning in to grass. \U0001f605 new gcs are the new grass that\u2019s growing our profession. they\u2019ve sprouted after school culti\u2026 https://t.co/tkdajwa9ic',
 u'on august 10 we have two webinars 

In [28]:
webinars = []
s=set(stopwords.words('english'))
webinarWords = []
searchWords=["webinar", "present"]
for i in range(len(events)):
    if any(e in events[i].lower() for e in searchWords):
        webinars.append(events[i])
#find common words to identify webinar events.
for i in range(len(webinars)):
    webinarWords += filter(lambda w: not w in s,webinars[i].split())


counter2 = Counter(webinarWords)
freqWordsInWebinars = counter2.most_common(30)
len(webinars) #freqWordsInWebinars
webinars

[u'on august 10 we have two webinars taking place! the first is at 10 a.m. ct is an overview and update on tricky neur\u2026 https://t.co/u78nur1ruv',
 u'26 y m w/ hearing loss (hl) presents for initial eval. fam hx: 1 younger sister w/ hl and a father w/ hl &amp; chronic\u2026 https://t.co/qnzxac1grr',
 u'rt @studyrare: 37 y m presents to the ed with chest pain. cxr shows pneumothorax with multiple subpleural cysts. height and weight are @ th\u2026',
 u'rt @dinaalaeddin: the countdown has started! join us, a group of arab genetic counselors as we present with dr. jon weil about: steps towar\u2026',
 u'rt @dinaalaeddin: the countdown has started! join us, a group of arab genetic counselors as we present with dr. jon weil about: steps towar\u2026',
 u'37 y m presents to the ed with chest pain. cxr shows pneumothorax with multiple subpleural cysts. height and weight\u2026 https://t.co/90x2pxqkfl',
 u'rt @dinaalaeddin: the countdown has started! join us, a group of arab genetic counselors

In [33]:
#remove dupes
webinars = list(set(webinars))
                
webinars

[u'rt @hd_genetics: great to present today at @help4hdi virtual hipe day talking a bit about the journey to create hd genetics!  thank you for\u2026',
 u'rt @geneticcouns: join us monday, july 25 from 12-1 pm ct for our webinar on organizational change methods to decrease clinician burnout le\u2026',
 u'14 y f w/ a hx of hereditary fructose intolerance (hfi) presents to the clinic for her yearly evaluation. she adher\u2026 https://t.co/mdem5i20g9',
 u'the countdown has started! join us, a group of arab genetic counselors as we present with dr. jon weil about: steps\u2026 https://t.co/lycargtjkv',
 u"i am so glad that @geneticcouns has an expert speaker (jesse gavin) on clinician burnout for today's webinar. if yo\u2026 https://t.co/2crv0mxbse",
 u"a 28 y f at 10 wks gestation presents for reproductive counseling. she is an fmr1 premut'n carrier (65 cgg repeats)\u2026 https://t.co/a1rgc4bp1s",
 u"a 25 year old woman presents w/ a fam hx of fragile x syndrome. trinucleotide repeat testin

In [43]:
#sort events into dates
#for i in range(len(webinars)):
months = list(calendar.month_name)   
for i in range(len(months)):
    months[i] = months[i].lower()
months.pop(0)
times = []
nums = [1,2,3,4,5,6,7,8,9,10,11,12]
days = range(1,32)
halfTimes = ["00", "15", "30", "45"]
dates = []
dates += ["monday","tuesday","wedsnesday","thursday","friday","saturday","sunday","tomorrow", "days", "month"]
for i in range(len(months)):
    for j in range(len(months)):
        dates.append(months[i] + str(days[j]))
        dates.append(months[i] + " " + str(days[j]))
for i in range(len(nums)):
    for j in range(len(nums)):
        times.append(str(nums[i])+ "-" + str(nums[j]))
        times.append(str(nums[i])+ "- " + str(nums[j]))
        times.append(str(nums[i])+ " -" + str(nums[j]))
        times.append(str(nums[i])+ " - " + str(nums[j]))
datedWebinars = []
for i in range(len(webinars)):
    if any(e in webinars[i] for e in dates):
        datedWebinars.append(webinars[i])
datedWebinars

[u'rt @geneticcouns: join us monday, july 25 from 12-1 pm ct for our webinar on organizational change methods to decrease clinician burnout le\u2026',
 u'on august 10 we have two webinars taking place! the first is at 10 a.m. ct is an overview and update on tricky neur\u2026 https://t.co/u78nur1ruv']

In [45]:
calendarEvents = {}

#can find times and dates based on format needed for presentation
for i in range(len(datedWebinars)):
    for j in range(len(dates)):
        calendarEvents[datedWebinars[i]] = dates[j]
calendarEvents

{u'on august 10 we have two webinars taking place! the first is at 10 a.m. ct is an overview and update on tricky neur\u2026 https://t.co/u78nur1ruv': 'december 12',
 u'rt @geneticcouns: join us monday, july 25 from 12-1 pm ct for our webinar on organizational change methods to decrease clinician burnout le\u2026': 'december 12'}

### Results

### Discussion