Skip to content

CSFelix/prime-numbers-statistical-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”’ Prime Numbers Statistical Analysis


Let's dive into the Prime Numbers' World


πŸ‘‹ Hello guys, today we will go into an advanced math topic: Prime Numbers. Our task here is to understand what they are, their importance and usages in real-life; understand the Riemann Hypothesis; make a Statistic Analysis of a dataset containing the first one million prime numbers; and play with simpy to classify numbers into primes and non-primes.


πŸ“ Problem Description

- Overview

Prime Numbers are especific natural numbers that cannot be divised by other natural ones, but itself and 1. Natural numbers are those ones that are positives and integers, that is, fall into the range from 1 to positive infinite. And you may be wondering: "Yeah, yeah, I learned about them on school. So what? What do they have that make them so special?". You'll find it out right now!

You probably have ever heard from someone or read on the internet that you can do anything with your computer, especially if you know how to program. I am afraid, but I have bad news: That's "bullshit"! To a problem be considered possible to be solved by computers, it has to follow two conditions: 1) a pattern to solve the problem must exist and be possible to be implemented with a computer; and 2) the implementation should solve it in a considerable period of time using a considerable process cost.

In Computer Science there are the Nondeterministic Polynomial Problems (NPs), they are problems that does not follow one or both of the previous conditions, and guess what? Identifying whether a number is prime or not is one of them - just for curiosity, all NP problems are listed here: List of Unsolved Problems in Mathematics.

Guessing whether 5, 8, 80 and 987 are primes or not is a easy task for a computer, but when the number of digits increases, it becomes a tough task, demanding more time and process cost to solve it. For instance, consider the number 983,129,319,128,349,245,423,213,643,369,126,335,654,234,126,324,363, quite a big one, right? It would take years to be identified as prime or non-prime by a computer.


An image containing random numbers

Fig. 1 - Numbers. UnacademyΒ©

- Cyber Security

Having this in mind, Security Engineers chosen the Prime Numbers to be the base of Cyber Security Cyphers, such as Hash Tables and Rivest–Shamir–Adleman (RSA) Algorithm. In RSA, two large prime numbers are multiplied to result in a huge number to generate the public and private keys that will be used to encrypt and decrypt messages respectively. If a computer be able to decompose this huge number, the machine would be able to get the prime numbers that generated it and then, calculate the private key and decript any message encrypted by its public key.

If this becomes possible, the cyber-security in general would be in chaos. Imagine hackers getting access to your private messages and bank accounts. Now think bigger, picture hackers getting access to government databases and stealing and encrypting critical datas. The world would be in collapse!!


- Riemann's Hypothesis

Riemann's Hypothesis is a Hypothesis Equation that tells if the the number is prime or not. To our better understanding, consider the spiral below:


Archimedean Spiral

Fig. 2 - Archimedean Spiral. Michael M. Ross at naturalnumbers.orgΒ©

Now, let's fill the spiral line with numbers from 0 to infinite positive starting at the center.


Archimedean Spiral with numbers

Fig. 3 - Archimedean Spiral with numbers. Michael M. Ross at naturalnumbers.orgΒ©

Identifying the counterclockwise, that is, the positions where a perfect square is formed, we'll realize that these positions are always in the east side - 1, 4, 9...:


Archimedean Spiral counterclockwises

Fig. 4 - Archimedean Spiral counterclockwises. Michael M. Ross at naturalnumbers.orgΒ©

And, after calculating the Spiral's Product Curves, we'll notice that all results corresponds to Prime Numbers Positions (Darkest Dots):


Archimedean Spiral's Product Curves

Fig. 5 - Archimedean Spiral's Product Curves. Michael M. Ross at naturalnumbers.orgΒ©

As more products we calculate, as more prime numbers we find; and tracing a sequential line on each dot position, we'll get another perfect spiral!


Archimedean Spiral's Product Curves Lined

Fig. 6 - Archimedean Spiral's Product Curves Lined. Michael M. Ross at naturalnumbers.orgΒ©

Taking this into account, Riemann Hypothesis gives an equation to calculate if the dot position of a given number fits a prime number dot in an Archimedean Spiral.

zeta(s) = sum(1/n**s) # being n from 1 to positive infinite

$$ \zeta(s) = \sum_{n=1}^{\infty} \frac{1}{n^s} $$

Even though this equation is the closest one to call as a pattern to identify prime numbers and working out for the first thousand ones, we cannot say that it is 100% correct since we cannot test it out to all existent numbers!

OBS.: this was a short explanation about Riemann Hypothesis, if you wanna see it deeper, I would recommend watching the The Riemann Hypothesis, Explained YouTube video from Quanta Magazine channel.


πŸ“ Files Description

first_million_prime_numbers.csv and first_million_prime_numbers.parquet - files containing the first million prime numbers.

natural_numbers.csv and natural_numbers.parquet - files containing the numbers from one to the first million prime number.


The .parquet files are more suited for large datasets. In order to don't take a big section explaining the differences between CSV and Parquet files, here is a cheat sheet for you:


Comparison between CSV, Parquet and JSON Files

Fig. 7 - Comparison between CSV, Parquet and JSON Files. Even though Parquet files are harder to be read, they are more suitable for larger and complexes datasets. Also, you can read this type of file using any Apache tools, such as `Spark`, `Hadoop` and `Hive` or simply using the `read_parquet` function from Pandas library. Weber Stephen at MediumΒ©

❓ Features

Rank - prime number index;

Num | Number - number;

Interval - interval between the current and the previous prime number.


🌟 Target

Is Prime - tells whether a number is prime (True) or not (False).

Is Prime Meaning
True Number is prime
False Number is not prime

πŸ›‘ Limitations

As far as the dataset just contains the first million prime numbers, we cannot guarantee that this Statistical Analysis will be true for all existent ones that are not present in the dataset.

Riemann Hypothesis: Keep in mind that the limitation we got here is the same one present in Riemann Hypothesis.

🎯 Goals

Goal 1 - general statisc analysis of the dataset;

Goal 2 - check out for correlation between Num and Interval features;

Goal 3 - apply statistic tests for Hypothesis Confirmations;

Goal 4 - create natural_numbers.csv and natural_numbers.parquet files to store all numbers from 1 to the first million prime;

Goal 5 - explore the isPrime function from sympy library;

Goal 6 - explore the distribution of prime numbers over non-prime numbers.


βš™οΈ Setup

Tools

- Python 3.10.x version or later;
- Jupyter Lab;
- Jupyter Notebook.

Libraries

- Pandas, Numpy, Matplotlib, Seaborn, Sympy and Scipy;
- Pandas Profiling and Lux API;
- Apache Spark and Java JDK version 8 or later (if you wanna use Apache Spark rather Pandas);
- Pyarrow (to generate .parquet files).

πŸŽ‰ Acknowledgements

first_million_prime_numbers dataset provided by Michael M. Ross in naturalnumbers.org.

The Riemann Hypothesis, Explained YouTube video provided by Quanta Magazine channel.


Hope you enjoyed this notebook and feel free to check out more of my projects!!


πŸ“« Reach Me

Email - csfelix08@gmail.com

Linkedin - linkedin.com/in/csfelix/

GitHub: - CSFelix

Kaggle - DSFelix

Portfolio - CSFelix.io.