# Project Overview

Matthew Morgan, B.S

### Introduction

Scraping Twitter for data relating to the field and profession of Genetic Counseling. 
<br>
<br>A common problem in entering the genetic counseling field is lack of exposure and difficulty in finding that exposure to the field of genetic counseling. With many resources being scattered and posted on different channels, it is difficult to locate and combine everything. The goal of this project is to scrape links and information from the internet related to genetic counseling and pool those together as a conglomerate of resources for prospective genetic counseling students to use. This will hopefully provide a more convenient resource for finding exposure and opportunites in genetic counseling.

### Related Works

### Methods

**Focusing on Twitter:**
<br>1: Look for common twitter tags that relate to genetic counseling
<br>&emsp; a. Got event advertisements from 08/2020 - 07/2022 from a discord channel for prospective genetic counseling students.
<br>&emsp; b. Filtered through these events to create a wordbank of common words to find genetic counseling events for prospective students.
<br>&emsp; c. Used https://databasic.io/en/wordcounter/results/62cb4b55d16c580354d7dd92 to find most common words. Then analyzed to select the words that made &emsp;&emsp;up at least .05% of the word pool. These words were then cleaned further and used to search for common hashtags
<br>&emsp; d. Scraped steps a-c because tweepy rate-limits the number of tweets that can be pulled at 900 tweets / 15minutes. Decided to just use the phrase &emsp;&emsp;"prospective student genetic counseling" and "genetic counseling" to search for tweet resources.
<br>2: Use those tags to find posts that contain links or resources advertisements, can also look for events and webinars.
<br>3: Analyze tweets for authenticity and usefullness using follower count, length of account, quality of interaction (positive or negative), and more.
<br>4: create a sorted list of resources with descriptions and helpful information
<br>5: (TBD) advertise the list (possibly as a website or a post somewhere? TBD)
<br><br>Started by using Selenium Chromedriver to gather tweet data, but it soon became apparent that this would be alot of work and would require constant updates as both Chrome and Twitter constantly update HTML format and versions. Next the Tweepy Twitter API was attempted to be used and offered a much simpler way of gathering Twitter data.
<br><br>Then realized I can use tweepy to just get the dataset of relevant tweets from twitter using hashtags as a form of a "broad filter" to get genetic counseling tweets. Can first search genetic counseling keywords to find most common hashtags to help find desired tweets. Can then use those hashtags to extract relevant tweets into a dataset that can be analyzed.

### Code

In [36]:
import tweepy
import config
from collections import Counter

In [32]:
#filter through GC events to find most common keywords

filename = "methods/commonWords.txt"

file = open(filename, "r")

words = []
 
# Traversing file line by line
for line in file:
    line = line.strip()
    if (any(char.isdigit() for char in line) == False):
        words.append(line)
file.close();

find hashtags

In [44]:
#filter through GC events to find most common keywords
api_key = config.api_key
api_secrets = config.api_key_secret
access_token = config.access_token
access_secret = config.access_token_secret

# Authenticate to Twitter
auth = tweepy.OAuthHandler(api_key,api_secrets)
auth.set_access_token(access_token,access_secret)
api = tweepy.API(auth)

#find and store genetic counseling tweets for prospective students
def findCommonHashTags(keyWord):
    hashtags = []
    for tweet in tweepy.Cursor(api.search, q=keyWord, count=10).items():
        hashtagList = tweet.entities.get('hashtags')
        for i in range(len(hashtagList)):
            hashtags.append(hashtagList[i]['text'])
        pass
    return hashtags

hashtags = []
searchTermList = ['genetic counseling','prospective student genetic counseling']
for i in range(len(searchTermList)):
    hashtags += findCommonHashTags(searchTermList[i])

counter = Counter(hashtags)
freqHashtags = counter.most_common(3)

print(freqHashtags)

[(u'GeneChat', 21), (u'GeneticCounselors', 16), (u'GeneticCounseling', 6)]


Common results are: "GeneChat", "GeneticCounselors", "GeneticCounseling"

In [68]:
def getResources(keyWord):
    links = []
    events = []
    for tweet in tweepy.Cursor(api.search, q=keyWord, count=100).items():
        events.append(tweet.text) #get text to thrift through?
        linkList = tweet.entities.get('urls')
        for i in range(len(linkList)):#cant get unwound url without enterprise api
            links.append(linkList[i]['url'])#can get html title and description
    resources = {'links':links,'events':events}
    return resources

resources = []
searchTermList = ['#GeneChat', "#GeneticCounselors", "#GeneticCounseling"]
for i in range(len(searchTermList)):
    resources.append(getResources(searchTermList[i]))

In [63]:
events = []
links = []
for i in range(len(resources)):
    events += resources[i]['events']
    links += resources[i]['links']

In [64]:
links

[u'https://t.co/vN3bGiudz6',
 u'https://t.co/gkBsNd0Cd3',
 u'https://t.co/zRBbPBQmfs',
 u'https://t.co/tdY1fMUNOa',
 u'https://t.co/FOsJD1t6v6',
 u'https://t.co/NueGWo1GNr',
 u'https://t.co/5vLcYUbojq',
 u'https://t.co/yy3tZqUXBI',
 u'https://t.co/koMHvuFnu5',
 u'https://t.co/4CUhqSkjom',
 u'https://t.co/Isco6LE0K4',
 u'https://t.co/sokycybG2e',
 u'https://t.co/rCMMIps0MJ',
 u'https://t.co/JhPoyp0XyM',
 u'https://t.co/1N2XailCdL',
 u'https://t.co/1N2XailCdL',
 u'https://t.co/P61kKUBn45',
 u'https://t.co/WbDQ5SJwPv',
 u'https://t.co/8cDKlFqGLw',
 u'https://t.co/UQHWqGHjMZ',
 u'https://t.co/E70Usdg8Q9',
 u'https://t.co/uZY9fnyI4j',
 u'https://t.co/psRy8fT8px',
 u'https://t.co/a8rDO4sfX1',
 u'https://t.co/BFOMmU6sNm',
 u'https://t.co/evlCslmePW',
 u'https://t.co/dANG92JPk4',
 u'https://t.co/rdt2he7xhM',
 u'https://t.co/7kmz5WWkIn',
 u'https://t.co/tZgnVUjBDm',
 u'https://t.co/13XRzjTf6j',
 u'https://t.co/SD86EkUu9E',
 u'https://t.co/QM8Z2T3O9j',
 u'https://t.co/UuxatsufzZ',
 u'https://t.c

### Results

### Discussion