Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten AS Sets & getting sizes and other stats #114

Closed
2 of 3 tasks
SichangHe opened this issue Dec 24, 2023 · 21 comments
Closed
2 of 3 tasks

Flatten AS Sets & getting sizes and other stats #114

SichangHe opened this issue Dec 24, 2023 · 21 comments
Labels
deferred Maybe work on this later lateststats Latest and up-to-date statistics noted Noted this in the writing

Comments

@SichangHe
Copy link
Owner

SichangHe commented Dec 24, 2023

Flattened 71346 AS Sets in 471ms.

Commit: a6cfea2

as_set_sizes.csv.gz

TODO:

  • Fit Zipf distribution on sizes.
    • Goodness of fit.
      - [x] Fit Zipf distribution on depths.
  • Count cycles.
@SichangHe SichangHe added the lateststats Latest and up-to-date statistics label Dec 24, 2023
@SichangHe
Copy link
Owner Author

Overview of AS Set sizes (including pseudo sets).

In [2]: df = pd.read_csv('as_set_sizes.csv.gz')

In [3]: df
Out[3]:
                        as_set  size
0            AS-399760-CLIENTS     1
1                   AS-ISIONUK     2
2                     c#203447     2
3                  AS-NICPROXY     7
4           AS199221:AS-199221     2
...                        ...   ...
71341                 AS-BITON     1
71342                 AS-30998     8
71343    AS268235:AS-CUSTOMERS    42
71344                    AS-V6    16
71345  AS-SET-LEVEL-7-INTERNET     5

[71346 rows x 2 columns]

In [4]: df.describe()
Out[4]:
               size
count  71346.000000
mean     439.580089
std     4461.051650
min        0.000000
25%        1.000000
50%        1.000000
75%        5.000000
max    93709.000000

@SichangHe
Copy link
Owner Author

Stats for pseudo sets and real sets:

In [5]: df_w_hash = df[df['as_set'].str.contains('#')]

In [6]: df_w_hash
Out[6]:
                                    as_set  size
2                                 c#203447     2
5                                  c#22987     5
6                                 c#395466     1
8                                  c#61605     2
9                m#AS-CBL-TRANSIT#MNT-BAHA     1
...                                    ...   ...
71309  m#AS-IXBR-TRANSIT4-SP#MAINT-AS28186     1
71318                              c#59239    58
71326                             c#212271     2
71337                               c#7795    76
71339                              c#46393     1

[17970 rows x 2 columns]

In [7]: df_wo_hash = df[~df['as_set'].str.contains('#')]

In [8]: df_wo_hash
Out[8]:
                        as_set  size
0            AS-399760-CLIENTS     1
1                   AS-ISIONUK     2
3                  AS-NICPROXY     7
4           AS199221:AS-199221     2
7         AS28855:AS-CUSTOMERS     1
...                        ...   ...
71341                 AS-BITON     1
71342                 AS-30998     8
71343    AS268235:AS-CUSTOMERS    42
71344                    AS-V6    16
71345  AS-SET-LEVEL-7-INTERNET     5

[53376 rows x 2 columns]

In [9]: df_wo_hash.describe()
Out[9]:
               size
count  53376.000000
mean     584.558847
std     5149.282015
min        0.000000
25%        1.000000
50%        2.000000
75%        6.000000
max    93709.000000

@SichangHe
Copy link
Owner Author

Sketch histogram.

image

Code used (generated).
import matplotlib.pyplot as plt

# Plotting histogram
plt.hist(df_wo_hash['size'], bins=1000, edgecolor='black')
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Size')
plt.ylabel('Frequency')
plt.title('Histogram of Sizes in df_wo_hash')
plt.grid(True)
plt.show()

@SichangHe
Copy link
Owner Author

AS Sets (86MiB): as_sets.txt.gz

@SichangHe
Copy link
Owner Author

SichangHe commented Jan 8, 2024

It seems that the Zipf distribution fitting failed!?

IPython history.
In [1]: import pandas as pd
pdf
In [2]: df = pd.read_csv('as_set_sizes.csv.gz')

In [3]: df
Out[3]:
                        as_set  size
0            AS-399760-CLIENTS     1
1                   AS-ISIONUK     2
2                     c#203447     2
3                  AS-NICPROXY     7
4           AS199221:AS-199221     2
...                        ...   ...
71341                 AS-BITON     1
71342                 AS-30998     8
71343    AS268235:AS-CUSTOMERS    42
71344                    AS-V6    16
71345  AS-SET-LEVEL-7-INTERNET     5

[71346 rows x 2 columns]

In [5]: from scipy.stats import zipf, fit

In [8]: res = fit(zipf, df["size"], [(1.0, 10.0)])

In [9]: res
Out[9]:
  params: FitParams(a=1.5713559159360042, loc=0.0)
 success: False
 message: 'Optimization converged to parameter values that are inconsistent with the data.'
PMF Plotting.

image

Commit: SichangHe/internet_route_verification_meta@d912451

@cunha
Copy link
Collaborator

cunha commented Jan 8, 2024 via email

@SichangHe
Copy link
Owner Author

… Not a Zipf.

@SichangHe SichangHe added Stats Statistics record and removed lateststats Latest and up-to-date statistics labels Jan 10, 2024
@SichangHe
Copy link
Owner Author

as_set_sizes1.csv.gz

Ipython history: #121

@SichangHe
Copy link
Owner Author

Previous results were wrong due to the erroneous script (fixed in e29d04d).

Results from corrected script, with depth: as_set_sizes2.csv.gz

In [79]: df_raw = pd.read_csv("as_set_sizes2.csv.gz")

In [80]: df = df_raw[~df_raw['as_set'].str.contains('#')]

In [81]: df
Out[81]:
                        as_set  size  depth
0            AS-399760-CLIENTS     1      0
1                   AS-ISIONUK     2      0
3                  AS-NICPROXY     7      0
4           AS199221:AS-199221     2      0
7         AS28855:AS-CUSTOMERS     1      0
...                        ...   ...    ...
71213                 AS-BITON     1      0
71214                 AS-30998     8      0
71215    AS268235:AS-CUSTOMERS    42      0
71216                    AS-V6    16      0
71217  AS-SET-LEVEL-7-INTERNET     5      0

[53268 rows x 3 columns]

In [82]: df.describe()
Out[82]:
               size         depth
count  53268.000000  53268.000000
mean     648.176260      1.730194
std     5636.609586     13.263516
min        0.000000      0.000000
25%        1.000000      0.000000
50%        2.000000      0.000000
75%        6.000000      0.000000
max    95591.000000    299.000000

Distribution of sizes are almost unchanged, because the previous bug only impacted the few nested sets.
As we can see now, most sets are flat.

Still not a Zipf.

image

Same script used as above.

@SichangHe
Copy link
Owner Author

SichangHe commented Jan 10, 2024

In [119]: df.tail(260)
Out[119]:
                    as_set   size  depth
53008         AS3326:AS-UA  89580     40
53009   AS15562:AS-LEVEL42      1     41
53010   AS15562:AS-LEVEL43      1     42
53011   AS15562:AS-LEVEL44      1     43
53012   AS15562:AS-LEVEL45      1     44
...                    ...    ...    ...
53263  AS15562:AS-LEVEL296      1    295
53264  AS15562:AS-LEVEL297      1    296
53265  AS15562:AS-LEVEL298      1    297
53266  AS15562:AS-LEVEL299      1    298
53267  AS15562:AS-LEVEL300      1    299

[260 rows x 3 columns]

It seems that AS15562 just have a chain of AS Sets. Though, they are rather the anomaly.

@SichangHe
Copy link
Owner Author

The Zipf distribution fits after removing sets with no members: SichangHe/internet_route_verification_meta@cf00b74

In [122]: res = fit(zipf, df[df["size"] > 0]["size"], [(1.0, 10.0)])
     ...: print(res)
  params: FitParams(a=1.498530833262867, loc=0.0)
 success: True
 message: 'Optimization terminated successfully.'

image

@SichangHe
Copy link
Owner Author

Depths also fit: SichangHe/internet_route_verification_meta@f3a25d0.

In [126]:     df = df_wo_hash[df_wo_hash["depth"] > 0]
     ...:     res = fit(zipf, df["depth"], [(1.0, 10.0)])
     ...:     print(res)
  params: FitParams(a=1.8108806886591249, loc=0.0)
 success: True
 message: 'Optimization terminated successfully.'

image

@SichangHe SichangHe added noted Noted this in the writing lateststats Latest and up-to-date statistics and removed Stats Statistics record labels Jan 10, 2024
@SichangHe
Copy link
Owner Author

@SichangHe SichangHe reopened this Jan 22, 2024
@SichangHe SichangHe changed the title Flatten AS Sets & getting sizes Flatten AS Sets & getting sizes and other stats Jan 22, 2024
@SichangHe
Copy link
Owner Author

as_set_graph_stats.csv

@SichangHe
Copy link
Owner Author

SichangHe commented Jan 22, 2024

Using the stats from the AS Set graph stats. (Edit: updated commit: SichangHe/internet_route_verification_meta@437b5fa)

$ python3 -m scripts.stats.as_set_size_fitting
Overview:
         n_sets    n_nums     depth
count  53268.00  53268.00  53268.00
mean      98.11    646.85      2.37
std      946.29   5628.59     13.10
min        0.00      0.00      0.00
25%        0.00      1.00      1.00
50%        0.00      2.00      1.00
75%        1.00      6.00      1.00
max    20426.00  95572.00    300.00

AS Set sizes in AS Num counts.
7746 (14.54%) AS Sets have no AS Num.
Fitting Zipf distribution: Negative log-likelihood 146934.38357877164.
  params: FitParams(a=1.4983353782683766, loc=0.0)
 success: True
 message: 'Optimization terminated successfully.'

AS Set nesting depths.
Fitting Zipf distribution: Negative log-likelihood inf.
  params: FitParams(a=2.4292771966768467, loc=0.0)
 success: False
 message: 'Optimization converged to parameter values that are inconsistent with the data.'

AS Set with cycles.
3112 (5.84%) AS Sets have cycles.
3050 (22.42%) have cycles, 3129 (23.00%) have depth 5 or more, among 13602 AS Sets containing other AS Sets.

The first negative log-likelihood seems too big. The depth fitting somehow failed after we apply the more accurate stats from the AS Set graphs.

@SichangHe
Copy link
Owner Author

62 AS Sets contain themselves and no other AS Sets.
In [7]: has_cycle_no_set = df[df['has_cycle'] & (df['n_sets'] == 0)]

In [8]: has_cycle_no_set
Out[8]:
                                  as_set  n_sets  n_nums  depth  has_cycle
335                      AS399899:AS-ALL       0       0      0       True
877             AS266594:AS-DATACENTRICS       0       0      0       True
1300                           AS-400352       0       0      0       True
2313   AS272677:AS-FIBRATECHTELECOM-ONLY       0       0      0       True
4134                           AS-395359       0       0      0       True
...                                  ...     ...     ...    ...        ...
63388                    AS-STELECOM-ALL       0       8      1       True
64771           AS270436:AS-SOOU-TELECOM       0       1      1       True
69368                           AS-HERTZ       0       1      1       True
70309                             AS-PDL       0       1      1       True
70678                       as-ouiheberg       0       4      1       True

[62 rows x 5 columns]

In [9]: print(has_cycle_no_set.to_string())
                                  as_set  n_sets  n_nums  depth  has_cycle
335                      AS399899:AS-ALL       0       0      0       True
877             AS266594:AS-DATACENTRICS       0       0      0       True
1300                           AS-400352       0       0      0       True
2313   AS272677:AS-FIBRATECHTELECOM-ONLY       0       0      0       True
4134                           AS-395359       0       0      0       True
4467                      AS268525:AS-E3       0       2      1       True
5146                           AS-FUTURE       0       0      0       True
5951                       AS-ETELECOM-2       0       0      0       True
9676                   AS29831:AS-LEGACY       0       0      0       True
10654                           AS-11700       0       0      0       True
11515                   AS17917:AS-55474       0       0      0       True
12037       AS270912:AS-GMN-TELECOM-ONLY       0       0      0       True
12780                           AS-42386       0       0      0       True
14320             AS136093:AS-SET-FAZNET       0       0      0       True
14760              AS269782:AS-CUSTOMERS       0       4      1       True
14870                           AS-38215       0       0      0       True
16693            AS264301:AS-LNP-TELECOM       0       1      1       True
16809                           AS-LEKKS       0       0      0       True
16968                      AS-ALBIDEYNET       0       0      0       True
17071                       AS-63603-GSS       0       2      1       True
17102                           as-44702       0       0      0       True
17869         AS268215:AS-REDESPEED-ONLY       0       0      0       True
20497                          AS-397178       0       0      0       True
21150                     AS-13720-SOE-2       0       0      0       True
26822                        AS-BURSABIL       0       2      1       True
27301                   AS400457:AS-HWLC       0       0      0       True
27943     AS268430:AS-FLASHLINK-INTERNET       0       1      1       True
28304                      AS-SDV-BERLIN       0       1      1       True
32296                           AS-33350       0       0      0       True
32773                           AS-NATCO       0       0      0       True
33330                     AS26388:AS-ALL       0       1      1       True
33436                     AS46816:AS-ALL       0       1      1       True
33457                           AS-19256       0       4      1       True
33627             AS-ITHOLDINGSCUSTOMERS       0      33      1       True
36517                  AS266671:AS-ALLv6       0       1      1       True
38950                           as-hsams       0       1      1       True
39662                     AS-IPv6-edu-pl       0       1      1       True
40957        AS265283:AS-IAGONET-TELECOM       0       1      1       True
41057                    AS19879:AS-CORE       0       1      1       True
44263       AS269782:AS-NETWORKSPEED-001       0       1      1       True
44328                          AS-RLINE1       0       0      0       True
44731                          AS-397446       0       1      1       True
45079                          AS-394036       0       0      0       True
45407                          as-210750       0       0      0       True
46907                           AS-58650       0       0      0       True
47889                  AS30002:AS-NULOOP       0       1      1       True
48593                         AS-AL-2098       0       1      1       True
49858                           AS-BDNET       0       4      1       True
51126                    AS-OPERATELECOM       0       0      0       True
51959                          AS-398052       0       0      0       True
55095                         AS-AS32489       0       0      0       True
56880                         AS-RIKSNET       0       5      1       True
57650            AS400088:AS-WAWA-PUBLIC       0       0      0       True
57842                           AS-21783       0       1      1       True
58890                        AS-NETNITCO       0       1      1       True
59504                           AS-36113       0       0      0       True
61486                 AS204141:AS-KNETDK       0       1      1       True
63388                    AS-STELECOM-ALL       0       8      1       True
64771           AS270436:AS-SOOU-TELECOM       0       1      1       True
69368                           AS-HERTZ       0       1      1       True
70309                             AS-PDL       0       1      1       True
70678                       as-ouiheberg       0       4      1       True

@cunha
Copy link
Collaborator

cunha commented Jan 25, 2024 via email

@SichangHe
Copy link
Owner Author

For the goodness of fit of the Zipf distribution, it seems that Chi-squared test may be our best bet. It is quite a pain to find testing for discrete distributions. Related.

@cunha
Copy link
Collaborator

cunha commented Jan 25, 2024 via email

@SichangHe SichangHe added the deferred Maybe work on this later label Jan 26, 2024
@SichangHe
Copy link
Owner Author

Closing because the mention of the Zipf is removed in the text.

@SichangHe
Copy link
Owner Author

Overview:
         n_sets    n_nums     depth
count  53268.00  53268.00  53268.00
mean      98.11    646.85      2.37
std      946.29   5628.59     13.10
min        0.00      0.00      0.00
25%        0.00      1.00      1.00
50%        0.00      2.00      1.00
75%        1.00      6.00      1.00
max    20426.00  95572.00    300.00

AS Set sizes in AS Num counts.
7746 (14.54%) AS Sets have no AS Num.
Fitting Zipf distribution: Negative log-likelihood 146934.38357877184.
  params: FitParams(a=1.4983354144185919, loc=0.0)
 success: True
 message: 'Optimization terminated successfully.'
17430 (32.7%) AS Sets contain only one AS Num.
772 (1.4%) AS Sets contain more than 10,000 AS Nums.

AS Set nesting depths.
Fitting Zipf distribution: Negative log-likelihood inf.
  params: FitParams(a=2.4291181853610495, loc=0.0)
 success: False
 message: 'Optimization converged to parameter values that are inconsistent with the data.'

AS Set with cycles.
3112 (5.84%) AS Sets have cycles.
3050 (22.42%) have cycles, 3129 (23.00%) have depth 5 or more, among 13602 AS Sets containing other AS Sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deferred Maybe work on this later lateststats Latest and up-to-date statistics noted Noted this in the writing
Projects
None yet
Development

No branches or pull requests

2 participants