<a href="https://colab.research.google.com/github/SocialMediaLab/TweetsSamplingToolkit/blob/main/demo_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import gzip, zlib
import shutil
import pandas as pd
import urllib.request
from io import StringIO

In [2]:
#get external libary to work with tweetids
!git clone https://github.com/SocialMediaLab/Tweets_Sampling_Toolkit.git

Cloning into 'TweetsSamplingToolkit'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 36 (delta 17), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (36/36), done.


In [4]:
# navigate to the working directory
%cd Tweets_Sampling_Toolkit

/content/TweetsSamplingToolkit


In [5]:
# install packages requirements (This is an optional step)
#!pip install -r requirements.txt

In [6]:
# Import Tweets Sampling Library 
import tweets_sampling

In [8]:
# load and unpack an external gzip'ed CSV file with Tweet IDs
gz_file  = "https://figshare.com/ndownloader/files/31249681"
data = urllib.request.urlopen(gz_file)
data_obj = data.read()
bytes_data = gzip.decompress(data_obj)

# save the input data as a local csv file for processing 
s=str(bytes_data,'utf-8')
csv_data = StringIO(s) 
df=pd.read_csv(csv_data)
# checking that Tweet IDs are safely stored in a dataframe
df.head()

Unnamed: 0,_id
0,1432855497170849793
1,1432855499712696322
2,1432855500056453124
3,1432855504041164801
4,1432855504225640453


In [9]:
# save the input data as a local csv file for processing with the Toolkit
# NOTE: if you are running this script in Google's Colab, the free storage is 100Gb
out_file = "sample1.csv"
df.to_csv(out_file, index=False)

In [10]:
# Read Tweet IDs from a local CSV file 
ifm = tweet_id_subsets.id_file_manager(out_file)
ifm.id_count

760413

In [11]:
# Create a sample containing 20% of our original file
percent_sample = ifm.get_random_sample(
    0.2,
    'percent_sample.csv', 
    sample_mode='percent'
)
percent_sample.id_count

Generating random sample


100%|██████████| 152082/152082 [00:01<00:00, 126438.77it/s]


152082

In [12]:
# Split the large file into 5 subsets
pages = ifm.get_page_samples(5, 'pages.csv')

for page in pages:
    print(f'{page.file_name}: {page.id_count}')

pages_0.csv: 152082
pages_1.csv: 152082
pages_2.csv: 152082
pages_3.csv: 152082
pages_4.csv: 152085


In [13]:
#Comparing Files
a = ifm.get_random_sample(0.2, 'percent_sample1.csv', sample_mode='percent')
b = ifm.get_random_sample(0.2, 'percent_sample2.csv', sample_mode='percent')



Generating random sample


100%|██████████| 152082/152082 [00:01<00:00, 125595.79it/s]


Generating random sample


100%|██████████| 152082/152082 [00:01<00:00, 127503.87it/s]


In [14]:
# Get all of tweet ids that are in both a and b
intersection = a.get_intersection(b, 'intersection.csv')
intersection.id_count

Splitting file


1000000it [00:00, 1977200.21it/s]


Sorting IDs


100%|██████████| 1/1 [00:00<00:00, 2439.97it/s]


Merging IDs


100%|██████████| 152082/152082 [00:00<00:00, 373120.67it/s]
100%|██████████| 152082/152082 [00:09<00:00, 16648.70it/s]


30439

In [15]:
#Difference
# Get all of the IDs that are in a, but not b
difference = a.get_difference(b, 'difference.csv')
difference.id_count

Splitting file


1000000it [00:00, 1883751.24it/s]


Sorting IDs


100%|██████████| 1/1 [00:00<00:00, 3243.85it/s]


Merging IDs


100%|██████████| 152082/152082 [00:00<00:00, 345965.93it/s]
100%|██████████| 152082/152082 [00:09<00:00, 16433.54it/s]


121643

In [None]:
# Union
union = a.get_union(b, 'union.csv')
union.id_count

Creating union of the two files in union.csv
Splitting file


1000000it [00:00, 2025857.03it/s]


Sorting IDs


100%|██████████| 1/1 [00:00<00:00, 3830.41it/s]


Merging IDs


100%|██████████| 152082/152082 [00:00<00:00, 344477.23it/s]


Writing results from percent_sample1.csv


100%|██████████| 152082/152082 [00:00<00:00, 515897.32it/s]


Writing results from percent_sample2.csv


100%|██████████| 152082/152082 [00:09<00:00, 16067.28it/s]


273767