# Twitter Data Scrape
This notebook focuses on scraping user profile data using user id and tag.
The idea is to collect a list of user id associated with certain tag and collect information from the user profile using the list of user id. 

In [1]:
# Import libraries
from ntscraper import Nitter
import pandas as pd
from pprint import pprint

12-Sep-24 11:35:59 - Note: NumExpr detected 22 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
12-Sep-24 11:35:59 - NumExpr defaulting to 8 threads.


### Data Scrape using Nitter

In [2]:
# Set up scraper
scraper = Nitter(log_level=1, skip_instance_check=False)

# Extract user id associated with certain hashtag
twitter_hash_tweets = scraper.get_tweets("PresidentialDebate2024", mode = "hashtag", number = 5)

Testing instances: 100%|██████████| 16/16 [00:11<00:00,  1.36it/s]

12-Sep-24 11:36:11 - No instance specified, using random instance https://nitter.lucabased.xyz





12-Sep-24 11:36:26 - Current stats for PresidentialDebate2024: 5 tweets, 0 threads...


In [3]:
pprint(twitter_hash_tweets)

{'threads': [],
 'tweets': [{'date': 'Sep 12, 2024 · 3:36 PM UTC',
             'external-link': '',
             'gifs': [],
             'is-pinned': False,
             'is-retweet': False,
             'link': 'https://twitter.com/FOmyronpitts/status/1834254661685067865#m',
             'pictures': [],
             'quoted-post': {},
             'replying-to': [],
             'stats': {'comments': 0, 'likes': 0, 'quotes': 0, 'retweets': 0},
             'text': 'This was not the first time that Harris trolled Trump on '
                     'his crowd size ...  Nor was it the first time that a '
                     'strong, female candidate debated Trump and exposed his '
                     'superficial nature.  #PresidentialDebate2024 '
                     '#KamalaHarris #DonaldTrump #HillaryClinton '
                     'https://www.fayobserver.com/story/opinion/2024/09/12/kamala-harris-beats-trump-in-debate-fayetteville-crowd-cheers/75182980007/',
             'user': {'a

### Explore data structure from scrape result

In [4]:
print(type(twitter_hash_tweets))
len(twitter_hash_tweets["tweets"])

<class 'dict'>


5

In [5]:
print(type(twitter_hash_tweets["tweets"]))
print(len(twitter_hash_tweets["tweets"][0]))

<class 'list'>
13


Based on the executed lines of code, the resulting tweets from Nitter stores in the form of python dictionary and the tweets data within the dictionary appears to be python list objects.

In [6]:
# loop through the list object to see what kinds of object are stored in the list
for i in twitter_hash_tweets['tweets'][0]:
    print(i)
print('----------------------------------------')
print(type(twitter_hash_tweets['tweets'][0]))

link
text
user
date
is-retweet
is-pinned
external-link
replying-to
quoted-post
stats
pictures
videos
gifs
----------------------------------------
<class 'dict'>


In [7]:
print(type(twitter_hash_tweets['tweets'][0]['user']))
print(twitter_hash_tweets['tweets'][0]['user'])

<class 'dict'>
{'name': 'Myron B. Pitts', 'username': '@FOmyronpitts', 'profile_id': '1351937197918793728', 'avatar': 'https://pbs.twimg.com/profile_images/1351937197918793728/ksU1VlAv_bigger.jpg'}


In [8]:
print(twitter_hash_tweets['tweets'][0]['user']['username'])

@FOmyronpitts


Now we have knows fully regarding the result of the scrape. Let's extract all the user id

### Extract user id

In [9]:
# Helper function
def extract_element_from_list(init_list:list, element:str) -> list:
    """ 
    The purpose of the function is to extract certain element from the initial list
    Elements included in tweet: (link, text, user, date, is-retweet, is-pinned, external-link, replying-to, quoted-post, stats, pictures, videos, gifs)
    Elements includde in users: (name, username, profile_id, avatar)

    init_list - A list stores a list of dictionary object. 
    elements - A string of the exact element listed above.
    result - A list of detected element
    """
    result = []
    for i in init_list:
        result.append(i[element]) # i must be dictionary 
    return result

In [10]:
tweets = twitter_hash_tweets["tweets"]
text_of_tweets = extract_element_from_list(tweets, 'text')
users = extract_element_from_list(tweets, 'user')
ids = extract_element_from_list(users, 'username')

In [11]:
f = open('ids.txt', 'w')
for i in ids:
    f.write(f'{i}\n')
f.close()

### Extract user profile

In [19]:
# Helper function
def remove_character(init_list:list, character:str) -> list:
    """ 
    The purpose is to remove a character from all elements within the init_list.

    init_list - a list of strings
    charactuer - string value
    result - a list of strings with the character removed. 
    """
    result = []
    for i in init_list:
        if i.count(character) > 0:
            after = i.replace(character, '', 1) # Only remove the first appearance
            result.append(after)
    
    return result

def extract_profile(list_of_ids:list, text_of_tweets:list) -> pd.DataFrame:
    """ 
    The purpose is to extract user profile information using their id from a list_of_ids and add the text of their tweets to the dataframe as well. 
    
    list_of_ids - a list of ids
    result - a panda dataframe 
    text_of_tweets - a list of text of tweets that corresponds to the ids in order.
    """
    scraper = Nitter(log_level=0, skip_instance_check=False)
    result_list = []
    index_id = 0
    for id in list_of_ids:
        temp = []
        try:
            scrape = scraper.get_profile_info(id, mode='detail')
        except ValueError:
            index_id += 1
            continue # Might have a problem finding an instance
        if scrape is None:
            index_id += 1
            continue # Skip all none data to avoid error
        bio = scrape['bio']
        id = scrape['id']
        joined = scrape['joined']
        location = scrape['location']
        name = scrape['name']
        num_follower = scrape['stats']['followers']
        num_following = scrape['stats']['following']
        likes = scrape['stats']['likes']
        media = scrape['stats']['media']
        num_tweets = scrape['stats']['tweets']
        username = scrape['username']
        website = scrape['website']
        text = text_of_tweets[index_id]

        index_id += 1

        temp.append({
            'id': id,
            'username': username,
            'bio': bio,
            'joined_date': joined,
            'location': location,
            'name': name,
            'num_follower': num_follower,
            'num_following': num_following,
            'likes': likes,
            'media': media,
            'num_tweets': num_tweets,
            'website': website,
            'text': text
        })

        result_list.extend(temp)
    
    return pd.DataFrame(result_list)

def check_exist_ids(list_of_ids:list, df:pd.DataFrame) -> tuple:
    """ 
    The purpose of the function is to deleted scrape id from the list of id that needs to be scrape. 

    list_of_ids - a list of ids
    df - a pandas dataframe that stored all the records of scraped data
    result - a tuple with a list of id that needs to be scrape and a list of index where we used to delete certain tweet text.
    """
    old_ids = list(df.loc[:, 'username'])
    index = []
    temp = 0
    for id in list_of_ids:
        if old_ids.count(id) > 0:
            list_of_ids.remove(id)
            index.append(temp)
        temp += 1

    return (list_of_ids, index)

In [20]:
old_df = pd.read_csv('users.csv')
need_ids = check_exist_ids(ids, old_df)
actual_ids = remove_character(need_ids[0], '@')

for i in range(len(need_ids[1])):
    text_of_tweets.pop(need_ids[1][i])

print(actual_ids)
print(text_of_tweets)

['FOmyronpitts', 'janninereid1', 'autumnsdad1']
['This was not the first time that Harris trolled Trump on his crowd size ...  Nor was it the first time that a strong, female candidate debated Trump and exposed his superficial nature.  #PresidentialDebate2024 #KamalaHarris #DonaldTrump #HillaryClinton https://www.fayobserver.com/story/opinion/2024/09/12/kamala-harris-beats-trump-in-debate-fayetteville-crowd-cheers/75182980007/', 'People who AREN\'T on X or aren\'t following politics believe the stuff these " news" channels are saying.  Below is an example of what many of those Americans would NEVER know about.⬇️  #PresidentialDebate2024 #cbsnews #MagaMemeQueen 👑', 'Even Robert Kennedy says Harris beat Trump in the debate..love his honesty and wonder if he has second thoughts on supporting Trump after he failed to even bring up MAHA .. a huge thing Kennedy believes in..I’m voting for RFK.. #PresidentialDebate2024']


In [24]:
df = extract_profile(actual_ids, text_of_tweets)

Testing instances: 100%|██████████| 16/16 [00:10<00:00,  1.49it/s]


12-Sep-24 13:35:57 - Empty page on https://nitter.lucabased.xyz
12-Sep-24 13:41:48 - Empty page on https://nitter.lucabased.xyz
12-Sep-24 13:41:51 - Fetching error: User "janninereid1" not found
12-Sep-24 13:41:55 - Fetching error: Instance has been rate limited.Use another instance or try again later.


In [25]:
display(df)

Unnamed: 0,id,username,bio,joined_date,location,name,num_follower,num_following,likes,media,num_tweets,website,text
0,19027021,@FOmyronpitts,"Father, Husband. Opinion Editor at @fayobserve...",4:16 PM - 15 Jan 2009,"Fayetteville, NC",Myron B. Pitts,4634,4269,12871,0,40852,https://www.fayobserver.com/staff/5491811002/m...,This was not the first time that Harris trolle...


### Update Dataframe

In [29]:
new_df = pd.concat([df, old_df], ignore_index=True, sort=False)
new_df.to_csv('users.csv')