-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flatten AS Sets & getting sizes and other stats #114
Comments
Overview of AS Set sizes (including pseudo sets). In [2]: df = pd.read_csv('as_set_sizes.csv.gz')
In [3]: df
Out[3]:
as_set size
0 AS-399760-CLIENTS 1
1 AS-ISIONUK 2
2 c#203447 2
3 AS-NICPROXY 7
4 AS199221:AS-199221 2
... ... ...
71341 AS-BITON 1
71342 AS-30998 8
71343 AS268235:AS-CUSTOMERS 42
71344 AS-V6 16
71345 AS-SET-LEVEL-7-INTERNET 5
[71346 rows x 2 columns]
In [4]: df.describe()
Out[4]:
size
count 71346.000000
mean 439.580089
std 4461.051650
min 0.000000
25% 1.000000
50% 1.000000
75% 5.000000
max 93709.000000 |
Stats for pseudo sets and real sets: In [5]: df_w_hash = df[df['as_set'].str.contains('#')]
In [6]: df_w_hash
Out[6]:
as_set size
2 c#203447 2
5 c#22987 5
6 c#395466 1
8 c#61605 2
9 m#AS-CBL-TRANSIT#MNT-BAHA 1
... ... ...
71309 m#AS-IXBR-TRANSIT4-SP#MAINT-AS28186 1
71318 c#59239 58
71326 c#212271 2
71337 c#7795 76
71339 c#46393 1
[17970 rows x 2 columns]
In [7]: df_wo_hash = df[~df['as_set'].str.contains('#')]
In [8]: df_wo_hash
Out[8]:
as_set size
0 AS-399760-CLIENTS 1
1 AS-ISIONUK 2
3 AS-NICPROXY 7
4 AS199221:AS-199221 2
7 AS28855:AS-CUSTOMERS 1
... ... ...
71341 AS-BITON 1
71342 AS-30998 8
71343 AS268235:AS-CUSTOMERS 42
71344 AS-V6 16
71345 AS-SET-LEVEL-7-INTERNET 5
[53376 rows x 2 columns]
In [9]: df_wo_hash.describe()
Out[9]:
size
count 53376.000000
mean 584.558847
std 5149.282015
min 0.000000
25% 1.000000
50% 2.000000
75% 6.000000
max 93709.000000 |
Sketch histogram. Code used (generated).import matplotlib.pyplot as plt
# Plotting histogram
plt.hist(df_wo_hash['size'], bins=1000, edgecolor='black')
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Size')
plt.ylabel('Frequency')
plt.title('Histogram of Sizes in df_wo_hash')
plt.grid(True)
plt.show() |
AS Sets (86MiB): as_sets.txt.gz |
It seems that the Zipf distribution fitting failed!? IPython history.In [1]: import pandas as pd
pdf
In [2]: df = pd.read_csv('as_set_sizes.csv.gz')
In [3]: df
Out[3]:
as_set size
0 AS-399760-CLIENTS 1
1 AS-ISIONUK 2
2 c#203447 2
3 AS-NICPROXY 7
4 AS199221:AS-199221 2
... ... ...
71341 AS-BITON 1
71342 AS-30998 8
71343 AS268235:AS-CUSTOMERS 42
71344 AS-V6 16
71345 AS-SET-LEVEL-7-INTERNET 5
[71346 rows x 2 columns]
In [5]: from scipy.stats import zipf, fit
In [8]: res = fit(zipf, df["size"], [(1.0, 10.0)])
In [9]: res
Out[9]:
params: FitParams(a=1.5713559159360042, loc=0.0)
success: False
message: 'Optimization converged to parameter values that are inconsistent with the data.' |
I've never seen anything like that before. I'd check whether the parameters passed to the function are correct... if that's not it then it may be that the data doesn't fit a Zipf distribution in the end.
…________________________________
From: Steven Hé (Sīchàng) ***@***.***>
Sent: Monday, January 8, 2024 9:54 AM
To: SichangHe/internet_route_verification ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [SichangHe/internet_route_verification] Flatten AS Sets & getting sizes (Issue #114)
It seems that the Zipf distribution fitting failed!?
IPython history.
In [1]: import pandas as pd
pdf
In [2]: df = pd.read_csv('as_set_sizes.csv.gz')
In [3]: df
Out[3]:
as_set size
0 AS-399760-CLIENTS 1
1 AS-ISIONUK 2
2 c#203447 2
3 AS-NICPROXY 7
4 AS199221:AS-199221 2
... ... ...
71341 AS-BITON 1
71342 AS-30998 8
71343 AS268235:AS-CUSTOMERS 42
71344 AS-V6 16
71345 AS-SET-LEVEL-7-INTERNET 5
[71346 rows x 2 columns]
In [5]: from scipy.stats import zipf, fit
In [8]: res = fit(zipf, df["size"], [(1.0, 10.0)])
In [9]: res
Out[9]:
params: FitParams(a=1.5713559159360042, loc=0.0)
success: False
message: 'Optimization converged to parameter values that are inconsistent with the data.'
—
Reply to this email directly, view it on GitHub<#114 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AACPO5YOLIU7YF7N6XYVHPDYNNGWFAVCNFSM6AAAAABBBO4AZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGI4TANZQGQ>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
… Not a Zipf. |
Ipython history: #121 |
Previous results were wrong due to the erroneous script (fixed in e29d04d). Results from corrected script, with depth: as_set_sizes2.csv.gz In [79]: df_raw = pd.read_csv("as_set_sizes2.csv.gz")
In [80]: df = df_raw[~df_raw['as_set'].str.contains('#')]
In [81]: df
Out[81]:
as_set size depth
0 AS-399760-CLIENTS 1 0
1 AS-ISIONUK 2 0
3 AS-NICPROXY 7 0
4 AS199221:AS-199221 2 0
7 AS28855:AS-CUSTOMERS 1 0
... ... ... ...
71213 AS-BITON 1 0
71214 AS-30998 8 0
71215 AS268235:AS-CUSTOMERS 42 0
71216 AS-V6 16 0
71217 AS-SET-LEVEL-7-INTERNET 5 0
[53268 rows x 3 columns]
In [82]: df.describe()
Out[82]:
size depth
count 53268.000000 53268.000000
mean 648.176260 1.730194
std 5636.609586 13.263516
min 0.000000 0.000000
25% 1.000000 0.000000
50% 2.000000 0.000000
75% 6.000000 0.000000
max 95591.000000 299.000000 Distribution of sizes are almost unchanged, because the previous bug only impacted the few nested sets. |
In [119]: df.tail(260)
Out[119]:
as_set size depth
53008 AS3326:AS-UA 89580 40
53009 AS15562:AS-LEVEL42 1 41
53010 AS15562:AS-LEVEL43 1 42
53011 AS15562:AS-LEVEL44 1 43
53012 AS15562:AS-LEVEL45 1 44
... ... ... ...
53263 AS15562:AS-LEVEL296 1 295
53264 AS15562:AS-LEVEL297 1 296
53265 AS15562:AS-LEVEL298 1 297
53266 AS15562:AS-LEVEL299 1 298
53267 AS15562:AS-LEVEL300 1 299
[260 rows x 3 columns] It seems that AS15562 just have a chain of AS Sets. Though, they are rather the anomaly. |
The Zipf distribution fits after removing sets with no members: SichangHe/internet_route_verification_meta@cf00b74 In [122]: res = fit(zipf, df[df["size"] > 0]["size"], [(1.0, 10.0)])
...: print(res)
params: FitParams(a=1.498530833262867, loc=0.0)
success: True
message: 'Optimization terminated successfully.' |
Depths also fit: SichangHe/internet_route_verification_meta@f3a25d0. In [126]: df = df_wo_hash[df_wo_hash["depth"] > 0]
...: res = fit(zipf, df["depth"], [(1.0, 10.0)])
...: print(res)
params: FitParams(a=1.8108806886591249, loc=0.0)
success: True
message: 'Optimization terminated successfully.' |
Using the stats from the AS Set graph stats. (Edit: updated commit: SichangHe/internet_route_verification_meta@437b5fa) $ python3 -m scripts.stats.as_set_size_fitting
Overview:
n_sets n_nums depth
count 53268.00 53268.00 53268.00
mean 98.11 646.85 2.37
std 946.29 5628.59 13.10
min 0.00 0.00 0.00
25% 0.00 1.00 1.00
50% 0.00 2.00 1.00
75% 1.00 6.00 1.00
max 20426.00 95572.00 300.00
AS Set sizes in AS Num counts.
7746 (14.54%) AS Sets have no AS Num.
Fitting Zipf distribution: Negative log-likelihood 146934.38357877164.
params: FitParams(a=1.4983353782683766, loc=0.0)
success: True
message: 'Optimization terminated successfully.'
AS Set nesting depths.
Fitting Zipf distribution: Negative log-likelihood inf.
params: FitParams(a=2.4292771966768467, loc=0.0)
success: False
message: 'Optimization converged to parameter values that are inconsistent with the data.'
AS Set with cycles.
3112 (5.84%) AS Sets have cycles.
3050 (22.42%) have cycles, 3129 (23.00%) have depth 5 or more, among 13602 AS Sets containing other AS Sets. The first negative log-likelihood seems too big. The depth fitting somehow failed after we apply the more accurate stats from the AS Set graphs. |
62 AS Sets contain themselves and no other AS Sets.In [7]: has_cycle_no_set = df[df['has_cycle'] & (df['n_sets'] == 0)]
In [8]: has_cycle_no_set
Out[8]:
as_set n_sets n_nums depth has_cycle
335 AS399899:AS-ALL 0 0 0 True
877 AS266594:AS-DATACENTRICS 0 0 0 True
1300 AS-400352 0 0 0 True
2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True
4134 AS-395359 0 0 0 True
... ... ... ... ... ...
63388 AS-STELECOM-ALL 0 8 1 True
64771 AS270436:AS-SOOU-TELECOM 0 1 1 True
69368 AS-HERTZ 0 1 1 True
70309 AS-PDL 0 1 1 True
70678 as-ouiheberg 0 4 1 True
[62 rows x 5 columns]
In [9]: print(has_cycle_no_set.to_string())
as_set n_sets n_nums depth has_cycle
335 AS399899:AS-ALL 0 0 0 True
877 AS266594:AS-DATACENTRICS 0 0 0 True
1300 AS-400352 0 0 0 True
2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True
4134 AS-395359 0 0 0 True
4467 AS268525:AS-E3 0 2 1 True
5146 AS-FUTURE 0 0 0 True
5951 AS-ETELECOM-2 0 0 0 True
9676 AS29831:AS-LEGACY 0 0 0 True
10654 AS-11700 0 0 0 True
11515 AS17917:AS-55474 0 0 0 True
12037 AS270912:AS-GMN-TELECOM-ONLY 0 0 0 True
12780 AS-42386 0 0 0 True
14320 AS136093:AS-SET-FAZNET 0 0 0 True
14760 AS269782:AS-CUSTOMERS 0 4 1 True
14870 AS-38215 0 0 0 True
16693 AS264301:AS-LNP-TELECOM 0 1 1 True
16809 AS-LEKKS 0 0 0 True
16968 AS-ALBIDEYNET 0 0 0 True
17071 AS-63603-GSS 0 2 1 True
17102 as-44702 0 0 0 True
17869 AS268215:AS-REDESPEED-ONLY 0 0 0 True
20497 AS-397178 0 0 0 True
21150 AS-13720-SOE-2 0 0 0 True
26822 AS-BURSABIL 0 2 1 True
27301 AS400457:AS-HWLC 0 0 0 True
27943 AS268430:AS-FLASHLINK-INTERNET 0 1 1 True
28304 AS-SDV-BERLIN 0 1 1 True
32296 AS-33350 0 0 0 True
32773 AS-NATCO 0 0 0 True
33330 AS26388:AS-ALL 0 1 1 True
33436 AS46816:AS-ALL 0 1 1 True
33457 AS-19256 0 4 1 True
33627 AS-ITHOLDINGSCUSTOMERS 0 33 1 True
36517 AS266671:AS-ALLv6 0 1 1 True
38950 as-hsams 0 1 1 True
39662 AS-IPv6-edu-pl 0 1 1 True
40957 AS265283:AS-IAGONET-TELECOM 0 1 1 True
41057 AS19879:AS-CORE 0 1 1 True
44263 AS269782:AS-NETWORKSPEED-001 0 1 1 True
44328 AS-RLINE1 0 0 0 True
44731 AS-397446 0 1 1 True
45079 AS-394036 0 0 0 True
45407 as-210750 0 0 0 True
46907 AS-58650 0 0 0 True
47889 AS30002:AS-NULOOP 0 1 1 True
48593 AS-AL-2098 0 1 1 True
49858 AS-BDNET 0 4 1 True
51126 AS-OPERATELECOM 0 0 0 True
51959 AS-398052 0 0 0 True
55095 AS-AS32489 0 0 0 True
56880 AS-RIKSNET 0 5 1 True
57650 AS400088:AS-WAWA-PUBLIC 0 0 0 True
57842 AS-21783 0 1 1 True
58890 AS-NETNITCO 0 1 1 True
59504 AS-36113 0 0 0 True
61486 AS204141:AS-KNETDK 0 1 1 True
63388 AS-STELECOM-ALL 0 8 1 True
64771 AS270436:AS-SOOU-TELECOM 0 1 1 True
69368 AS-HERTZ 0 1 1 True
70309 AS-PDL 0 1 1 True
70678 as-ouiheberg 0 4 1 True |
Funny;
worth mentioning in the paper.
…On Thu, Jan 25, 2024 at 11:50 AM Steven Hé (Sīchàng) < ***@***.***> wrote:
62 AS Sets contain themselves and no other AS Sets.
In [7]: has_cycle_no_set = df[df['has_cycle'] & (df['n_sets'] == 0)]
In [8]: has_cycle_no_setOut[8]:
as_set n_sets n_nums depth has_cycle335 AS399899:AS-ALL 0 0 0 True877 AS266594:AS-DATACENTRICS 0 0 0 True1300 AS-400352 0 0 0 True2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True4134 AS-395359 0 0 0 True
... ... ... ... ... ...63388 AS-STELECOM-ALL 0 8 1 True64771 AS270436:AS-SOOU-TELECOM 0 1 1 True69368 AS-HERTZ 0 1 1 True70309 AS-PDL 0 1 1 True70678 as-ouiheberg 0 4 1 True
[62 rows x 5 columns]
In [9]: print(has_cycle_no_set.to_string())
as_set n_sets n_nums depth has_cycle335 AS399899:AS-ALL 0 0 0 True877 AS266594:AS-DATACENTRICS 0 0 0 True1300 AS-400352 0 0 0 True2313 AS272677:AS-FIBRATECHTELECOM-ONLY 0 0 0 True4134 AS-395359 0 0 0 True4467 AS268525:AS-E3 0 2 1 True5146 AS-FUTURE 0 0 0 True5951 AS-ETELECOM-2 0 0 0 True9676 AS29831:AS-LEGACY 0 0 0 True10654 AS-11700 0 0 0 True11515 AS17917:AS-55474 0 0 0 True12037 AS270912:AS-GMN-TELECOM-ONLY 0 0 0 True12780 AS-42386 0 0 0 True14320 AS136093:AS-SET-FAZNET 0 0 0 True14760 AS269782:AS-CUSTOMERS 0 4 1 True14870 AS-38215 0 0 0 True16693 AS264301:AS-LNP-TELECOM 0 1 1 True16809 AS-LEKKS 0 0 0 True16968 AS-ALBIDEYNET 0 0 0 True17071 AS-63603-GSS 0 2 1 True17102 as-44702 0 0 0 True17869 AS268215:AS-REDESPEED-ONLY 0 0 0 True20497 AS-397178 0 0 0 True21150 AS-13720-SOE-2 0 0 0 True26822 AS-BURSABIL 0 2 1 True27301 AS400457:AS-HWLC 0 0 0 True27943 AS268430:AS-FLASHLINK-INTERNET 0 1 1 True28304 AS-SDV-BERLIN 0 1 1 True32296 AS-33350 0 0 0 True32773 AS-NATCO 0 0 0 True33330 AS26388:AS-ALL 0 1 1 True33436 AS46816:AS-ALL 0 1 1 True33457 AS-19256 0 4 1 True33627 AS-ITHOLDINGSCUSTOMERS 0 33 1 True36517 AS266671:AS-ALLv6 0 1 1 True38950 as-hsams 0 1 1 True39662 AS-IPv6-edu-pl 0 1 1 True40957 AS265283:AS-IAGONET-TELECOM 0 1 1 True41057 AS19879:AS-CORE 0 1 1 True44263 AS269782:AS-NETWORKSPEED-001 0 1 1 True44328 AS-RLINE1 0 0 0 True44731 AS-397446 0 1 1 True45079 AS-394036 0 0 0 True45407 as-210750 0 0 0 True46907 AS-58650 0 0 0 True47889 AS30002:AS-NULOOP 0 1 1 True48593 AS-AL-2098 0 1 1 True49858 AS-BDNET 0 4 1 True51126 AS-OPERATELECOM 0 0 0 True51959 AS-398052 0 0 0 True55095 AS-AS32489 0 0 0 True56880 AS-RIKSNET 0 5 1 True57650 AS400088:AS-WAWA-PUBLIC 0 0 0 True57842 AS-21783 0 1 1 True58890 AS-NETNITCO 0 1 1 True59504 AS-36113 0 0 0 True61486 AS204141:AS-KNETDK 0 1 1 True63388 AS-STELECOM-ALL 0 8 1 True64771 AS270436:AS-SOOU-TELECOM 0 1 1 True69368 AS-HERTZ 0 1 1 True70309 AS-PDL 0 1 1 True70678 as-ouiheberg 0 4 1 True
—
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACPO5ZHAKGQBUJZ4UOZS3LYQHJANAVCNFSM6AAAAABBBO4AZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGI4TSMRQGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
For the goodness of fit of the Zipf distribution, it seems that Chi-squared test may be our best bet. It is quite a pain to find testing for discrete distributions. Related. |
"This test is invalid when the observed or expected frequencies in each
category are too small. A typical rule is that all of the observed and
expected frequencies should be at least 5."
Maybe it doesn't work for the AS-set sizes (and Zipf distributions in
general, as most AS-set sizes will have very few samples).
I looked around for a reasonable test for us to use, but didn't find
anything easy to apply. (I've applied KS-tests in the past, but on a Zipf
distro I think it will not pass even though the fit may be acceptable.) I
also found a reference that r-squared isn't a very good approach (like you
mentioned). Maybe let's put this on the back burner for now.
…On Thu, Jan 25, 2024 at 8:37 PM Steven Hé (Sīchàng) < ***@***.***> wrote:
For the goodness of fit of the Zipf distribution, it seems that Chi-squared
test
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html>
may be our best bet. It is quite a pain to find testing for discrete
distributions. Related <https://stats.stackexchange.com/a/129734>.
—
Reply to this email directly, view it on GitHub
<#114 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACPO5ZYKJURFTLFDQYBRVLYQJGWFAVCNFSM6AAAAABBBO4AZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGEYTOMRXGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Closing because the mention of the Zipf is removed in the text. |
Overview:
n_sets n_nums depth
count 53268.00 53268.00 53268.00
mean 98.11 646.85 2.37
std 946.29 5628.59 13.10
min 0.00 0.00 0.00
25% 0.00 1.00 1.00
50% 0.00 2.00 1.00
75% 1.00 6.00 1.00
max 20426.00 95572.00 300.00
AS Set sizes in AS Num counts.
7746 (14.54%) AS Sets have no AS Num.
Fitting Zipf distribution: Negative log-likelihood 146934.38357877184.
params: FitParams(a=1.4983354144185919, loc=0.0)
success: True
message: 'Optimization terminated successfully.'
17430 (32.7%) AS Sets contain only one AS Num.
772 (1.4%) AS Sets contain more than 10,000 AS Nums.
AS Set nesting depths.
Fitting Zipf distribution: Negative log-likelihood inf.
params: FitParams(a=2.4291181853610495, loc=0.0)
success: False
message: 'Optimization converged to parameter values that are inconsistent with the data.'
AS Set with cycles.
3112 (5.84%) AS Sets have cycles.
3050 (22.42%) have cycles, 3129 (23.00%) have depth 5 or more, among 13602 AS Sets containing other AS Sets. |
Commit: a6cfea2
as_set_sizes.csv.gz
TODO:
- [x] Fit Zipf distribution on depths.The text was updated successfully, but these errors were encountered: