<a href="https://colab.research.google.com/github/SocialMediaLab/Tweets_Sampling_Toolkit/blob/main/demo_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import gzip, zlib
import shutil
import urllib.request
from io import StringIO

In [None]:
#get external libary to work with tweetids
!git clone https://github.com/SocialMediaLab/Tweets_Sampling_Toolkit.git

Cloning into 'Tweets_Sampling_Toolkit'...
remote: Enumerating objects: 65, done.[K
remote: Counting objects: 100% (65/65), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 65 (delta 33), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (65/65), done.


In [None]:
# navigate to the working directory
%cd Tweets_Sampling_Toolkit

/content/Tweets_Sampling_Toolkit


In [None]:
# install packages requirements (This is an optional step)
!pip install -r requirements.txt



In [None]:
# Import Tweets Sampling Library 
import tweets_sampling

In [None]:
# load and unpack an external gzip'ed CSV file with Tweet IDs
gz_file  = "https://figshare.com/ndownloader/files/31249681"
data = urllib.request.urlopen(gz_file)
data_obj = data.read()
bytes_data = gzip.decompress(data_obj)

with open('sample1.csv', 'wb') as f:
  f.write(bytes_data)

In [None]:
# Load a local CSV file 
# The entire file will not be read, instead, its length will be
# calculated based on its length
ifm = tweets_sampling.id_file_manager(out_file)
ifm.id_count

New file: 760413 IDs (sample1.csv)


760413

In [None]:
# Create a sample containing 20% of our original file
# Alternatively, you can use sample_mode="absolute" to create a
# sample with (for example) 3000 IDs
percent_sample = ifm.get_random_sample(
    0.2,
    'percent_sample.csv', 
    sample_mode='percent'
)
percent_sample.id_count

Generating random sample


100%|██████████| 152082/152082 [00:00<00:00, 155157.44it/s]

New file: 152082 IDs (percent_sample.csv)





152082

In [None]:
# Split the large file into 5 subsets
# Each ID from the original subset will be present in one of the
# five files
pages = ifm.get_page_samples(5, 'pages.csv')

# Print each resulting file's name and ID count
for page in pages:
    print(f'{page.file_name}: {page.id_count}')

Generating page 1 of 5


100%|██████████| 152082/152082 [00:00<00:00, 610806.58it/s]


New file: 152082 IDs (pages_0.csv)
Generating page 2 of 5


100%|██████████| 152082/152082 [00:00<00:00, 608373.25it/s]


New file: 152082 IDs (pages_1.csv)
Generating page 3 of 5


100%|██████████| 152082/152082 [00:00<00:00, 574566.13it/s]


New file: 152082 IDs (pages_2.csv)
Generating page 4 of 5


100%|██████████| 152082/152082 [00:00<00:00, 619503.66it/s]


New file: 152082 IDs (pages_3.csv)
Generating page 5 of 5


100%|██████████| 152085/152085 [00:00<00:00, 563545.54it/s]

New file: 152085 IDs (pages_4.csv)
pages_0.csv: 152082
pages_1.csv: 152082
pages_2.csv: 152082
pages_3.csv: 152082
pages_4.csv: 152085





In [None]:
# Comparing Files
# We will create two samples to compare to each other

a = ifm.get_random_sample(0.2, 'percent_sample1.csv', sample_mode='percent')
b = ifm.get_random_sample(0.2, 'percent_sample2.csv', sample_mode='percent')


Generating random sample


100%|██████████| 152082/152082 [00:00<00:00, 156903.57it/s]


New file: 152082 IDs (percent_sample1.csv)
Generating random sample


100%|██████████| 152082/152082 [00:00<00:00, 155565.69it/s]

New file: 152082 IDs (percent_sample2.csv)





In [None]:
# Get all of tweet ids that are in both a and b
# One of the files will be automatically sorted to allow a
# binary search algorithm to check for overlap

intersection = a.get_intersection(b, 'intersection.csv')
intersection.id_count

Sorting file (Step 1)


1000000it [00:00, 2589661.61it/s]


Sorting file (Step 2)


100%|██████████| 1/1 [00:00<00:00, 5140.08it/s]


Sorting file (Step 3)


100%|██████████| 152082/152082 [00:00<00:00, 423292.81it/s]


Checking files for overlap (Final Step)


100%|██████████| 152082/152082 [00:07<00:00, 19788.94it/s]

New file: 30434 IDs (intersection.csv)
30434 IDs were found in both percent_sample2.csv and percent_sample1.csv.





30434

In [None]:
#Difference
# Get all of the IDs that are in a, but not b
difference = a.get_difference(b, 'difference.csv')
difference.id_count

Generating difference file
Sorting file (Step 1)


1000000it [00:00, 2472619.23it/s]


Sorting file (Step 2)


100%|██████████| 1/1 [00:00<00:00, 4782.56it/s]


Sorting file (Step 3)


100%|██████████| 152082/152082 [00:00<00:00, 405999.08it/s]


Writing results from percent_sample1.csv


100%|██████████| 152082/152082 [00:07<00:00, 19148.86it/s]

New file: 121648 IDs (difference.csv)
30434 IDs were found in both percent_sample1.csv and percent_sample2.csv.





121648

In [None]:
# Get all of the files that are in either a or b
union = a.get_union(b, 'union.csv')
union.id_count

Creating union file
Sorting file (Step 1)


1000000it [00:00, 2491546.35it/s]


Sorting file (Step 2)


100%|██████████| 1/1 [00:00<00:00, 4675.92it/s]


Sorting file (Step 3)


100%|██████████| 152082/152082 [00:00<00:00, 399358.24it/s]


Writing results from percent_sample1.csv


100%|██████████| 152082/152082 [00:00<00:00, 582171.33it/s]


Writing results from percent_sample2.csv


100%|██████████| 152082/152082 [00:07<00:00, 19404.96it/s]

New file: 273730 IDs (union.csv)
30434 IDs were found in both percent_sample1.csv and percent_sample2.csv.





273730