# Web Scraping - Getting data for your machine learning models
- In today's world where data is key, that data can come from many different sources.
- It may come from a database, from a CSV file, document store, etc.

- In this notebook, let's instead target the web. To get that data from the web, we typically have to **scrape** the internet. What is that scraping? Let's wak through this notebook to learn more.

- We need data!
- Do you really want to copy and pase each web page? Absolutely not!
- This is where web scraping comes in:  
    - Automatic extract of data from website and databases

- While we are doing this to get threat intelligence, it can be used for:
    - competitor analysis, data mining, content aggregation, market research, etc.  
    - Do we wwant photos, text, videos, etc. We can scrape the web

- With webscraping, we want to extract what we need and more than likely store it for later analysis

**Tools available**  
-  Many tools available:  
    - BeautifulSoup   
        - Extract data from HTML and XML files   
            - Generates a parse tree that we can traverse   
            - While a great tool, not recommended for complex website scraping   
    - Selenium  
        - Focuses on browsers and can be slower  
        - It is very flexible and hence powerful webscraping tool
        - Limited support on non web based applications
        - Also slower than other scraping tools
    - Scrapy  
        - Fast, efficient and able to extract large amounts of data in short time
        - While great, not suitable for simple tasks
    - OctoParse  
        - Perfect for anyone wanting to extract data and save time learning code  
        - Uses a visual point and click interface  
        - No programming skills needed to get started 


Now that we have the tools, let's learn the techniques for scraping  
    - Document Object Model (DOM) parsing:
        - Analyzing the HTML structure of the page, to extract information from specific elements
        - DOM is a tree like structure that represents the web page
        - Requires a good understanding of HTML
        - This is where BeautifulSoup is very good
        - Requires a good understanding of HTML.
        - Also each page will be different so you need to understand the page  

    - Regular Expression:
        - Extract specific patterns from the web page
        - Now have two problems. Have to solve the original problem plus the regex problem
    
    - XPath:  
        - Language that allows you to navigate HTML documents and select specific elements or attributes
        - Can be used with LMXML or Scrapy
        - Can handle the most complicated web pages
        - Even if the page changes, you can still use XPath to extract information

- Whether scraping is legal or not is debatable.
- Whether it is ethical or not is debatable .
- Think New York Times suing Open AI
- You might want to go through the *Terms of Service* prior to scraping

- https://medium.com/geekculture/web-scraping-101-tools-techniques-and-best-practices-417e377fbeaf  

- Many websites do not welcome scraping and may block you  
- Now while these sites may allow Search Engines to scrape so they can show up in search results, they may just block you
- If you are going to be blocked, you want to appear human like and not like a bot
    - Use human tools  
    - Exhibit human like behaviour

- To prevent being blocked, you can use web scraping API - For example scraping bee
- Other ways to avoid getting blocked:
    - Use proxies  
    - Use a headless browser, such as Selenium 
        - Operates just like a browser but without an interface  

- Think about BeautifulSoup and Requests as the training wheel fro scraping  
- You can open a socket directly to the web page  
- Can use **XPATH**  
    - Can use XPATH to read an HTML document  
    - XPath Expressions  
    - Leverage XPath Engine (like XML)


*** Some Additional References for some different perspectives ***
- https://www.scrapingbee.com/blog/web-scraping-without-getting-blocked/   
- https://www.scrapingbee.com/blog/web-scraping-101-with-python/   
- https://dev.to/aurken/mastering-web-scraping-101-in-depth-guide-44k2  
- https://www.upwork.com/resources/web-scraping-basics
- https://dev.to/gologinapp/web-scraping-with-python-a-complete-step-by-step-guide-code-53fh  
- https://rvest.tidyverse.org/articles/rvest.html

In [1]:
# As always we need a set of libraries to work with
# Let us import those
import re
import requests
import numpy as np

In [2]:
# Define the URL to target
url = 'https://www.securitynik.com'
url

'https://www.securitynik.com'

In [3]:
# Get the web page
session = requests.Session()
#url = 'https://www.securitynik.com'
server_response = session.get(url=url)
server_response

<Response [200]>

In [4]:
# Get the server response length
# No length defined here
server_response.headers

{'Content-Type': 'text/html; charset=UTF-8', 'Expires': 'Wed, 24 Jul 2024 22:32:39 GMT', 'Date': 'Wed, 24 Jul 2024 22:32:39 GMT', 'Cache-Control': 'private, max-age=0', 'Last-Modified': 'Wed, 10 Jul 2024 15:29:10 GMT', 'ETag': 'W/"bc7dff59dbb1ec4431b4fb7c177468a300a1605a8b2d17f50f2440f2caccd1dd"', 'Content-Encoding': 'gzip', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Server': 'GSE', 'Transfer-Encoding': 'chunked'}

In [5]:
# Finding the length another way
server_response.text.__len__()

1175993

In [6]:
# Print the page returned
# Print the first 1000 characters
print(server_response.text[:1000])

<!DOCTYPE html>
<html class='v2' dir='ltr' lang='en' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' xmlns:data='http://www.google.com/2005/gml/data' xmlns:expr='http://www.google.com/2005/gml/expr'>
<head>
<link href='https://www.blogger.com/static/v1/widgets/3566091532-css_bundle_v2.css' rel='stylesheet' type='text/css'/>
<!-- ADDED BY NIK -->
<script async='async' src='//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'></script>
<script>
  (adsbygoogle = window.adsbygoogle || []).push({
    google_ad_client: "ca-pub-9102486211528771",
    enable_page_level_ads: true
  });
</script>
<!-- END -->
<meta content='width=1100' name='viewport'/>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
<meta content='blogger' name='generator'/>
<link href='https://www.securitynik.com/favicon.ico' rel='icon' type='image/x-icon'/>
<link href='https://www.securitynik.com/' rel='canonical'/>
<link rel="alternate" type="application/atom+xml" tit

In [7]:
# Time to import the BeautifulSoup library
from bs4 import BeautifulSoup

In [8]:
# Create the soup
soup = BeautifulSoup(server_response.text, 'html.parser')

In [9]:
# Let's find the first 'h1' tags
# Note here it is within the tag
soup.find('h1')

<h1 class="title">
Learning by practicing
</h1>

In [10]:
# Just get the text
soup.find('h1').text

'\nLearning by practicing\n'

In [11]:
# Let's find the first 'a' tags
soup.find('a')

<a name="2854979441102037864"></a>

In [12]:
# While above produced 1 entry in each case, is that all. 
# Let's find all
soup.findAll('a')[:6]

[<a name="2854979441102037864"></a>,
 <a href="https://www.securitynik.com/2024/03/total-recall-2024-memory-forensics-self.html">**TOTAL RECALL 2024** - Memory Forensics Self-Paced Learning/Challenge/CTF</a>,
 <a href="https://www.securitynik.com/2023/09/solving-ctf-challenge-network-forensics.html" target="_blank"><i>Solving the CTF challenge - Network Forensics (packet and log analysis), USB Disk Forensics, Database Forensics, Stego</i></a>,
 <a href="https://msrc.microsoft.com/blog/2023/09/results-of-major-technical-investigations-for-storm-0558-key-acquisition/">Results of Major Technical Investigations for Storm-0558 Key Acquisition | MSRC Blog | Microsoft Security Response Center</a>,
 <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVHyGrslhh9zVntuHCw21xP7IsuqvhoJiK8zMaeAvYpvBOakeadMJWx8yugVK7Eg7vNMlILr4QNFILjxS7GyByqdSAtcsyvAQ72sRHYeVzPPUVP9IoXPqlCilGa_p1LvYt-Kp-5Q4kCjiLNuJkmy4HQQT1qo3e2MmNbNra91Hw3dHzJn_FSGp-Js2sDhw/s743/NIST-800-86.PNG" style="margin-left:

In [13]:
# Let's get the href
# This will not show anyting
# That's because there is no href there
soup.find('a').get('href')

In [14]:
# However, if we get the name which is the first entry
soup.find('a').get('name')

'2854979441102037864'

In [15]:
# Looking at the widget content
# This does not return what we expect
soup.find(class_='widget-content')

<div class="widget-content">
<script async="" crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9102486211528771&amp;host=ca-host-pub-1556223355139109"></script>
<!-- securitynik_sidebar-right-1_AdSense1_1x1_as -->
<ins class="adsbygoogle" data-ad-client="ca-pub-9102486211528771" data-ad-format="auto" data-ad-host="ca-host-pub-1556223355139109" data-ad-slot="2741735213" data-full-width-responsive="true" style="display:block"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
<div class="clear"></div>
</div>

In [16]:
# We can instead use findAll or find_all to get all the records 
# That have this class name
# Below I am only looking at the first 2, in the interest of space
soup.find_all(class_='widget-content')[:2]

[<div class="widget-content">
 <script async="" crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-9102486211528771&amp;host=ca-host-pub-1556223355139109"></script>
 <!-- securitynik_sidebar-right-1_AdSense1_1x1_as -->
 <ins class="adsbygoogle" data-ad-client="ca-pub-9102486211528771" data-ad-format="auto" data-ad-host="ca-host-pub-1556223355139109" data-ad-slot="2741735213" data-full-width-responsive="true" style="display:block"></ins>
 <script>
 (adsbygoogle = window.adsbygoogle || []).push({});
 </script>
 <div class="clear"></div>
 </div>,
 <div class="widget-content">
 <ul>
 <li><a class="profile-name-link g-profile" href="https://www.blogger.com/profile/06803145785321596661" style="background-image: url(//www.blogger.com/img/logo-16.png);">Abdul</a></li>
 <li><a class="profile-name-link g-profile" href="https://www.blogger.com/profile/10282323977269843041" style="background-image: url(//www.blogger.com/img/logo-16.png);">Nik 

In [17]:
# Extracting the URLs for learning sites
soup_learning_links = soup.find_all(class_='widget-content')

# Show us the 7th widget-content class
soup_learning_links[6]

<div class="widget-content">
<b>[MALICIOUS IPs]</b><br/>
<a href="http://rules.emergingthreats.net/blockrules/compromised-ips.txt">Emerging Threats - Compromised IPs</a><br/>
<a href="http://myip.ms/files/blacklist/csf/latest_blacklist.txt">My IPs</a><br/>
<a href="http://www.spamhaus.org/drop/drop.txt">Spamhause drop </a><br/>
<a href="http://www.spamhaus.org/drop/edrop.txt">Spamhause edrop </a><br/>
<a href="http://rules.emergingthreats.net/fwrules/emerging-Block-IPs.txt">Emerging Threats Block IP</a><br/>
<a href="https://www.dshield.org/ipsascii.html">DShield</a><br/>
<a href="https://isc.sans.edu/block.txt">SANS ISC</a><br/>
<a href="https://zonefiles.io/f/compromised/ip/live/">Zonefiles.io</a><br/>
<a href="https://sslbl.abuse.ch/blacklist/sslipblacklist.txt
">SSL IP Blokclist</a><br/>
<a href="https://sslbl.abuse.ch/blacklist/sslipblacklist_aggressive.txt
">SSL IP Blokclist - Aggressive</a><br/>
<a href="https://feodotracker.abuse.ch/downloads/ipblocklist_recommended.txt
">Feedo

In [18]:
# Extracting the URLs for learning sites
soup_learning_links = soup.find_all(class_='widget-content')

# Look for the class again. This time, look for href
# Once again, just looking at a subset, in the interest of space
soup_learning_links[6].find_all('a', href=True)[:10]

[<a href="http://rules.emergingthreats.net/blockrules/compromised-ips.txt">Emerging Threats - Compromised IPs</a>,
 <a href="http://myip.ms/files/blacklist/csf/latest_blacklist.txt">My IPs</a>,
 <a href="http://www.spamhaus.org/drop/drop.txt">Spamhause drop </a>,
 <a href="http://www.spamhaus.org/drop/edrop.txt">Spamhause edrop </a>,
 <a href="http://rules.emergingthreats.net/fwrules/emerging-Block-IPs.txt">Emerging Threats Block IP</a>,
 <a href="https://www.dshield.org/ipsascii.html">DShield</a>,
 <a href="https://isc.sans.edu/block.txt">SANS ISC</a>,
 <a href="https://zonefiles.io/f/compromised/ip/live/">Zonefiles.io</a>,
 <a href="https://sslbl.abuse.ch/blacklist/sslipblacklist.txt
 ">SSL IP Blokclist</a>,
 <a href="https://sslbl.abuse.ch/blacklist/sslipblacklist_aggressive.txt
 ">SSL IP Blokclist - Aggressive</a>]

In [19]:
# Extract the same info above, this time via a for loop
for url in soup_learning_links[6].find_all('a', href=True):
    print(url.get('href'))

http://rules.emergingthreats.net/blockrules/compromised-ips.txt
http://myip.ms/files/blacklist/csf/latest_blacklist.txt
http://www.spamhaus.org/drop/drop.txt
http://www.spamhaus.org/drop/edrop.txt
http://rules.emergingthreats.net/fwrules/emerging-Block-IPs.txt
https://www.dshield.org/ipsascii.html
https://isc.sans.edu/block.txt
https://zonefiles.io/f/compromised/ip/live/
https://sslbl.abuse.ch/blacklist/sslipblacklist.txt

https://sslbl.abuse.ch/blacklist/sslipblacklist_aggressive.txt

https://feodotracker.abuse.ch/downloads/ipblocklist_recommended.txt

https://feodotracker.abuse.ch/downloads/ipblocklist.txt
https://raw.githubusercontent.com/pallebone/StrictBlockPAllebone/master/BlockIP.txt
http://www.malwaredomainlist.com/hostslist/delisted.txt
https://openphish.com/feed.txt
https://raw.githubusercontent.com/notracking/hosts-blocklists/master/hostnames.txt
https://raw.githubusercontent.com/notracking/hosts-blocklists/master/hostnames.txt
https://zonefiles.io/f/compromised/domains/live

In [20]:
# If we wanted we could store these in a list:
url_list = []
for url in soup_learning_links[6].find_all('a', href=True):
    url_list.append(url.get('href'))

# print the url_list
url_list

['http://rules.emergingthreats.net/blockrules/compromised-ips.txt',
 'http://myip.ms/files/blacklist/csf/latest_blacklist.txt',
 'http://www.spamhaus.org/drop/drop.txt',
 'http://www.spamhaus.org/drop/edrop.txt',
 'http://rules.emergingthreats.net/fwrules/emerging-Block-IPs.txt',
 'https://www.dshield.org/ipsascii.html',
 'https://isc.sans.edu/block.txt',
 'https://zonefiles.io/f/compromised/ip/live/',
 'https://sslbl.abuse.ch/blacklist/sslipblacklist.txt\n',
 'https://sslbl.abuse.ch/blacklist/sslipblacklist_aggressive.txt\n',
 'https://feodotracker.abuse.ch/downloads/ipblocklist_recommended.txt\n',
 'https://feodotracker.abuse.ch/downloads/ipblocklist.txt',
 'https://raw.githubusercontent.com/pallebone/StrictBlockPAllebone/master/BlockIP.txt',
 'http://www.malwaredomainlist.com/hostslist/delisted.txt',
 'https://openphish.com/feed.txt',
 'https://raw.githubusercontent.com/notracking/hosts-blocklists/master/hostnames.txt',
 'https://raw.githubusercontent.com/notracking/hosts-blocklists

In [21]:
import pandas as pd

In [22]:
# Create a pandas dataframe from these URLs.
# Or store them in a dabase if you wish
# We can use this however, we now wish
# We will speak later of how we deal with text for machine learning
url_df = pd.DataFrame(url_list, columns=['urls'])
url_df

Unnamed: 0,urls
0,http://rules.emergingthreats.net/blockrules/co...
1,http://myip.ms/files/blacklist/csf/latest_blac...
2,http://www.spamhaus.org/drop/drop.txt
3,http://www.spamhaus.org/drop/edrop.txt
4,http://rules.emergingthreats.net/fwrules/emerg...
5,https://www.dshield.org/ipsascii.html
6,https://isc.sans.edu/block.txt
7,https://zonefiles.io/f/compromised/ip/live/
8,https://sslbl.abuse.ch/blacklist/sslipblacklis...
9,https://sslbl.abuse.ch/blacklist/sslipblacklis...


In [23]:
# Maybe you wish to share this list to add the URLs to one of your threat intelligence tools
# Note you can save in many different formats
url_df.to_csv(path_or_buf=r'./urls.csv')

In [24]:
# Validate the file was created
!dir urls.csv /b

urls.csv


In [25]:
url_df.values[0][0]

'http://rules.emergingthreats.net/blockrules/compromised-ips.txt'

In [26]:
# a sample IP looks like: 192.168.0.1

# Setup the IPv4 regex
ipv4_regex = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
ipv4_regex

re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', re.UNICODE)

In [27]:
# Test string
test_string = 'Welcome to securitynik.com 10.0.111.1 us get the IP addresses 1.2.3.4'
test_string

'Welcome to securitynik.com 10.0.111.1 us get the IP addresses 1.2.3.4'

In [28]:
# Demonstrate how we find the IP
ipv4_regex.findall(test_string)

['10.0.111.1', '1.2.3.4']

In [29]:
# Read all data

# Setup a list for storing the scraped IPs
scraped_ips = []

# Only take the first 2. Just making a point
for idx, link in enumerate(url_df.values[:2]):
   print(f'Link: {idx}: {link[0]} : bytes: {len(server_response.content)} ')
   server_response = requests.get(link[0])
   #print(f'\tReceived:  bytes')
   scraped_ips.append(ipv4_regex.findall(server_response.text)) 

Link: 0: http://rules.emergingthreats.net/blockrules/compromised-ips.txt : bytes: 1175993 
Link: 1: http://myip.ms/files/blacklist/csf/latest_blacklist.txt : bytes: 4165 


In [30]:
# Take a look at the scaped IPs.
# Looking at a subset in the interest of space
scraped_ips[0][:10]

['101.47.6.209',
 '103.10.55.198',
 '103.195.238.130',
 '103.90.84.153',
 '104.143.77.8',
 '104.248.172.102',
 '106.70.252.79',
 '107.148.174.118',
 '111.91.178.253',
 '112.168.205.145']

In [31]:
# How many IPs did we get?
len(scraped_ips[0])

289

In [32]:
# We can see this looks like a list of list, let's flatten it
# Cheating here
scraped_ip_df = pd.DataFrame([ item for item in scraped_ips ][0], columns=['raw_ip'] )
scraped_ip_df.head(10)

Unnamed: 0,raw_ip
0,101.47.6.209
1,103.10.55.198
2,103.195.238.130
3,103.90.84.153
4,104.143.77.8
5,104.248.172.102
6,106.70.252.79
7,107.148.174.118
8,111.91.178.253
9,112.168.205.145


In [33]:
# Confirm the amount of IPs based on the dataframe
scraped_ip_df.shape

(289, 1)

In [34]:
# Let's say we had that IP 192.168.0.1
ip = '192.168.0.1'
ip

'192.168.0.1'

In [35]:
# What we want to do is split this into individual numbers
ip.split('.')

['192', '168', '0', '1']

In [36]:
# However, we need this as a number not a string
type(ip.split('.')[0])

str

In [37]:
# Getting this as an int
in_ip = [ int(item) for item in ip.split('.') ]
in_ip

[192, 168, 0, 1]

In [38]:
# We can also now confirm that these values are numbers
type(in_ip[0])

int

In [50]:
# We can use these IPs as is and treat them all like a strings
# I don't think this is a good strategy as we would still need to get these in as numbers
# Let's try splitting the instead and get each octet as a number
scraped_ip_df[['oct_1', 'oct_2', 'oct_3', 'oct_4']] = scraped_ip_df['raw_ip'].str.split('.',  expand=True)
scraped_ip_df.head(n=10)

Unnamed: 0,raw_ip,oct_1,oct_2,oct_3,oct_4,combined_bits
0,101.47.6.209,101,47,6,209,01100101001011110000011011010001
1,103.10.55.198,103,10,55,198,01100111000010100011011111000110
2,103.195.238.130,103,195,238,130,01100111110000111110111010000010
3,103.90.84.153,103,90,84,153,01100111010110100101010010011001
4,104.143.77.8,104,143,77,8,01101000100011110100110100001000
5,104.248.172.102,104,248,172,102,01101000111110001010110001100110
6,106.70.252.79,106,70,252,79,01101010010001101111110001001111
7,107.148.174.118,107,148,174,118,01101011100101001010111001110110
8,111.91.178.253,111,91,178,253,01101111010110111011001011111101
9,112.168.205.145,112,168,205,145,01110000101010001100110110010001


In [40]:
# We can use the number above as is or scale them
# We might instead want to convert them to their binary representation
np.binary_repr(4, width=8)

'00000100'

In [41]:
# https://stackoverflow.com/questions/2733788/convert-ip-address-string-to-binary-in-python 

ip_addr = '192.168.0.1'
print(list(''.join([ np.binary_repr(int(ip), width=8) for ip in ip_addr.split('.')])))

['1', '1', '0', '0', '0', '0', '0', '0', '1', '0', '1', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1']


In [51]:
# Add the columns as binary
scraped_ip_df['combined_bits'] = scraped_ip_df['raw_ip'].apply(lambda x: ''.join(np.binary_repr(int(i), width=8) for i in x.split('.')))
scraped_ip_df.head(n=10)

Unnamed: 0,raw_ip,oct_1,oct_2,oct_3,oct_4,combined_bits
0,101.47.6.209,101,47,6,209,01100101001011110000011011010001
1,103.10.55.198,103,10,55,198,01100111000010100011011111000110
2,103.195.238.130,103,195,238,130,01100111110000111110111010000010
3,103.90.84.153,103,90,84,153,01100111010110100101010010011001
4,104.143.77.8,104,143,77,8,01101000100011110100110100001000
5,104.248.172.102,104,248,172,102,01101000111110001010110001100110
6,106.70.252.79,106,70,252,79,01101010010001101111110001001111
7,107.148.174.118,107,148,174,118,01101011100101001010111001110110
8,111.91.178.253,111,91,178,253,01101111010110111011001011111101
9,112.168.205.145,112,168,205,145,01110000101010001100110110010001


In [52]:
# Wrap this up and split these into a new column
# Each bit is its own column
scraped_df_bits = scraped_ip_df['combined_bits'].str.join(sep=' ').str.split(pat=' ', expand=True).astype(int)
scraped_df_bits.head(n=10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,0,1,1,0,0,1,0,1,0,0,...,1,0,1,1,0,1,0,0,0,1
1,0,1,1,0,0,1,1,1,0,0,...,1,1,1,1,0,0,0,1,1,0
2,0,1,1,0,0,1,1,1,1,1,...,1,0,1,0,0,0,0,0,1,0
3,0,1,1,0,0,1,1,1,0,1,...,0,0,1,0,0,1,1,0,0,1
4,0,1,1,0,1,0,0,0,1,0,...,0,1,0,0,0,0,1,0,0,0
5,0,1,1,0,1,0,0,0,1,1,...,0,0,0,1,1,0,0,1,1,0
6,0,1,1,0,1,0,1,0,0,1,...,0,0,0,1,0,0,1,1,1,1
7,0,1,1,0,1,0,1,1,1,0,...,1,0,0,1,1,1,0,1,1,0
8,0,1,1,0,1,1,1,1,0,1,...,1,0,1,1,1,1,1,1,0,1
9,0,1,1,1,0,0,0,0,1,0,...,0,1,1,0,0,1,0,0,0,1


In [53]:
# Let's change the column names
scraped_df_bits.columns = [ f'bit_{i}' for i in range(32) ]
scraped_df_bits.head(n=10)

Unnamed: 0,bit_0,bit_1,bit_2,bit_3,bit_4,bit_5,bit_6,bit_7,bit_8,bit_9,...,bit_22,bit_23,bit_24,bit_25,bit_26,bit_27,bit_28,bit_29,bit_30,bit_31
0,0,1,1,0,0,1,0,1,0,0,...,1,0,1,1,0,1,0,0,0,1
1,0,1,1,0,0,1,1,1,0,0,...,1,1,1,1,0,0,0,1,1,0
2,0,1,1,0,0,1,1,1,1,1,...,1,0,1,0,0,0,0,0,1,0
3,0,1,1,0,0,1,1,1,0,1,...,0,0,1,0,0,1,1,0,0,1
4,0,1,1,0,1,0,0,0,1,0,...,0,1,0,0,0,0,1,0,0,0
5,0,1,1,0,1,0,0,0,1,1,...,0,0,0,1,1,0,0,1,1,0
6,0,1,1,0,1,0,1,0,0,1,...,0,0,0,1,0,0,1,1,1,1
7,0,1,1,0,1,0,1,1,1,0,...,1,0,0,1,1,1,0,1,1,0
8,0,1,1,0,1,1,1,1,0,1,...,1,0,1,1,1,1,1,1,0,1
9,0,1,1,1,0,0,0,0,1,0,...,0,1,1,0,0,1,0,0,0,1


In [54]:
# Let's put the two dataframes together
final_df = pd.concat([scraped_ip_df, scraped_df_bits], axis=1)
final_df.head(n=10)

Unnamed: 0,raw_ip,oct_1,oct_2,oct_3,oct_4,combined_bits,bit_0,bit_1,bit_2,bit_3,...,bit_22,bit_23,bit_24,bit_25,bit_26,bit_27,bit_28,bit_29,bit_30,bit_31
0,101.47.6.209,101,47,6,209,01100101001011110000011011010001,0,1,1,0,...,1,0,1,1,0,1,0,0,0,1
1,103.10.55.198,103,10,55,198,01100111000010100011011111000110,0,1,1,0,...,1,1,1,1,0,0,0,1,1,0
2,103.195.238.130,103,195,238,130,01100111110000111110111010000010,0,1,1,0,...,1,0,1,0,0,0,0,0,1,0
3,103.90.84.153,103,90,84,153,01100111010110100101010010011001,0,1,1,0,...,0,0,1,0,0,1,1,0,0,1
4,104.143.77.8,104,143,77,8,01101000100011110100110100001000,0,1,1,0,...,0,1,0,0,0,0,1,0,0,0
5,104.248.172.102,104,248,172,102,01101000111110001010110001100110,0,1,1,0,...,0,0,0,1,1,0,0,1,1,0
6,106.70.252.79,106,70,252,79,01101010010001101111110001001111,0,1,1,0,...,0,0,0,1,0,0,1,1,1,1
7,107.148.174.118,107,148,174,118,01101011100101001010111001110110,0,1,1,0,...,1,0,0,1,1,1,0,1,1,0
8,111.91.178.253,111,91,178,253,01101111010110111011001011111101,0,1,1,0,...,1,0,1,1,1,1,1,1,0,1
9,112.168.205.145,112,168,205,145,01110000101010001100110110010001,0,1,1,1,...,0,1,1,0,0,1,0,0,0,1


In [57]:
# Take all the columns from column 6 onwards
# These represent each of the bits
final_df.columns[6:]

Index(['bit_0', 'bit_1', 'bit_2', 'bit_3', 'bit_4', 'bit_5', 'bit_6', 'bit_7',
       'bit_8', 'bit_9', 'bit_10', 'bit_11', 'bit_12', 'bit_13', 'bit_14',
       'bit_15', 'bit_16', 'bit_17', 'bit_18', 'bit_19', 'bit_20', 'bit_21',
       'bit_22', 'bit_23', 'bit_24', 'bit_25', 'bit_26', 'bit_27', 'bit_28',
       'bit_29', 'bit_30', 'bit_31'],
      dtype='object')

In [59]:
# Create our X data to send to our supa dupa model ;-)
X = final_df.iloc[:, 6:].values
X

array([[0, 1, 1, ..., 0, 0, 1],
       [0, 1, 1, ..., 1, 1, 0],
       [0, 1, 1, ..., 0, 1, 0],
       ...,
       [0, 1, 0, ..., 1, 0, 1],
       [0, 1, 1, ..., 1, 0, 0],
       [0, 1, 1, ..., 0, 0, 1]])

In [60]:
# At this point, you can plug your data into your supa dupa cool algorithm
from sklearn.cluster import KMeans

In [61]:
# That's it for this session
my_supa_dupa_algo = KMeans().fit(X)
my_supa_dupa_algo