![](../images/logos/KIEPSKIES.jpg)

# <span style="color: #00008B;"> Data Acquisition and Preprocessing</span>
## <span style="color: #00008B;"> Data Acquisition</span>

Here are sources of data;

1. Public data sets like Kaggle 
2. Trusted Sources like government websites and research organizations 
3. Company data - e.g operational data, manual data entry, sensors data
4. Web APIs
5. Web scraping 
6. Databases 

I. <span style="color: #00008B;">**Public Sources**</span> 

Here is where you can find out free data for practice; 

* [Kaggle 🏆](https://www.kaggle.com/datasets)
* [Google Dataset Search 🔍](https://archive.ics.uci.edu/datasets/)
* [UCI Machine Learning Repository 🎓](https://datasetsearch.research.google.com/)

II. <span style="color: #00008B;">**Trusted Sources for Raw Data**</span> 

Most government, NGOs have websites provide data to the public for research purposes (weather, tides, census) and public verification(doctors and laywer verifications). Some of the websites are listed below; 

- [GSP Monitoring Data](https://data.cnra.ca.gov/dataset/gspmd) - Groundwater Sustainability Plan (GSP) Monitoring dataset contains the monitoring sites and associated groundwater level, subsidence or streamflow measurements collected by Groundwater Sustainability Agencies (GSA) during implementation of their GSP. All data is submitted to the Department of Water Resources (DWR) through the Sustainable Groundwater Management Act (SGMA) Portal’s Monitoring Network Module (MNM). Is provided by CALIFORNIA NATURAL RESOURCES AGENCY. 
- [Doc Info](https://www.docinfo.org/) - provides data for qualified doctors, their state(location) and level of education. 
- [The Kenya National Bureau of Statistics](https://www.knbs.or.ke/)
- [Kenya National Data Archive](https://statistics.knbs.or.ke/nada/index.php/catalog/179) (KeNADA)

III. <span style="color: #00008B;">**Company data**</span>

ompanies collect data from various internal sources to monitor operations, optimize performance, and support decision-making. These sources can be categorized into the following:

- Manual Data Entry - Data is manually inputted by employees, customers, or operators into spreadsheets, forms, or databases.Examples include Customer feedback forms,Employee attendance records and Sales logs entered by staff. **However this is prone to human error, time-consuming, and requires validation.**
- Sensor Data (IoT & Automated Systems) - Data iscollected from sensors, Internet of Things (IoT) devices, and automated tracking systems in real time. Examples include; GPS tracking for ships in the fishing industry, Temperature & salinity sensors for ocean monitoring, Machine performance & maintenance logs in aquaculture farms. **However, it requires proper storage, data processing, and system integration.**
- Company Operational Data - Data is generated through daily business activities, financial transactions, and resource management systems. Examples include; Inventory & supply chain data,Employee performance & payroll and Production & logistics reports. **However, it is often siloed across different departments, requiring data integration for full insights.**

IV. <span style="color: #00008B;">**Web APIs**</span>

APIs (Application Programming Interfaces) are the easiest and most ethical way to get companies' data from data scientist. They allow us to request real-time data from organizations.

Lets explore how to retrieve songs data from [Spotify](https://open.spotify.com/); 

The Spotify API Documentation can be accessed from [here](https://developer.spotify.com/documentation/web-api)

In [22]:
import requests  # Import the requests library for making HTTP requests

# Define Spotify API credentials (Replace with your actual credentials)
CLIENT_ID = '0d2cb3defcae451c8d935ff84090d752'
CLIENT_SECRET = '323e1049ddc944d5a46fcca7c8b124e8'

# URL for obtaining an access token
AUTH_URL = 'https://accounts.spotify.com/api/token'

# Send a POST request with credentials to get an access token
auth_response = requests.post(AUTH_URL, {
    'grant_type': 'client_credentials',  # Specify the authentication method
    'client_id': CLIENT_ID,  # Pass the Client ID
    'client_secret': CLIENT_SECRET,  # Pass the Client Secret
})

# Convert the response to JSON format
auth_response_data = auth_response.json()

# Extract and store the access token
access_token = auth_response_data.get('access_token')

# Print the access token (optional, for debugging purposes)
print("Access Token:", access_token)

Access Token: BQB0-fhkG0UclZylR5XCAXQ1nT0xx_8JwEUyu0I21EPsBrLd6o4PdXBf3oNZR7g6iXdTMkv6MRBDXHcA_iuKQAlttd0WLOsLL8VWluF87z_eGPnRDcmXvYFBIuMhOpa71ys_DYn2Qpo


In [3]:
# Create the authorization headers for API requests
headers = {
    'Authorization': 'Bearer {token}'.format(token=access_token)  # Add the access token to the header
}

In [33]:
# Base URL for all Spotify API endpoints
BASE_URL = 'https://api.spotify.com/v1/'

# Track ID for a specific song (Replace with any valid Spotify Track ID)
track_id = '2TpxZ7JUBn3uw46aR7qd6V'

# Send a GET request to fetch track details
r = requests.get(BASE_URL + 'tracks/' + track_id, headers=headers)

# Print the response (optional, for debugging)
# print(r.json())  # Converts response to JSON and prints the track details

What Happens?

- If the request is successful (status code 200), it returns track details like name, artist, album, and duration.
- If there’s an error, it might return 401 (Unauthorized) if the token is invalid or expired.

This request is a fundamental step in retrieving music metadata from Spotify’s API.

In [34]:
r = r.json()
r.keys()
# r

dict_keys(['album', 'artists', 'available_markets', 'disc_number', 'duration_ms', 'explicit', 'external_ids', 'external_urls', 'href', 'id', 'is_local', 'name', 'popularity', 'preview_url', 'track_number', 'type', 'uri'])

In [36]:
# Define the Artist ID (Example: Led Zeppelin)
artist_id = '36QJpDe2go2KgaRleHCDTp'

# Send a GET request to fetch all albums by the artist
r = requests.get(
    BASE_URL + 'artists/' + artist_id + '/albums',  # API endpoint for artist albums
    headers=headers,  # Authorization headers with the access token
    params={
        'include_groups': 'album',  # Retrieve only full-length albums
        'limit': 50  # Maximum number of albums to fetch per request
    }
)

# Convert the response to JSON format
d = r.json()

# Print the JSON response (optional, for debugging)
print(d.keys())

dict_keys(['href', 'limit', 'next', 'offset', 'previous', 'total', 'items'])


V. <span style="color: #00008B;">**Webscraping**</span>

Web scraping is the process of extracting data from websites using automated scripts. It is commonly used for:

- Gathering market intelligence (e.g., competitor prices, trends)
- Extracting data for research (e.g., financial, weather, or sports data)
- Collecting publicly available proxies, news, or social media content

To perform web scraping, we use Python libraries like:

- `requests` – To send HTTP requests and retrieve webpage content
- `BeautifulSoup` – To parse and extract structured data from HTML
- `pandas` – To store and manipulate the extracted data in a DataFrame

Here we will extract free proxies from  free-proxy-list.net and saves them in a pandas DataFrame. The following are key steps to be followed; 

i. Send an HTTP request to fetch the webpage's HTML content.
ii. Parse the HTML using BeautifulSoup to find the relevant data.
iii. Extract proxy details like IP address, port, country, and HTTPS support.
iv. Store the extracted data in a structured pandas DataFrame.

In [39]:
import requests  # For sending HTTP requests
from bs4 import BeautifulSoup  # For parsing HTML content
import pandas as pd  # For storing extracted data in a structured format

# Step 1: Define the target URL (Website with free proxy lists)
url = 'https://free-proxy-list.net/'

# Step 2: Send a GET request to fetch the webpage content
response = requests.get(url)

# Step 3: Parse the HTML content of the webpage using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Step 4: Locate the table containing proxy information
table = soup.find('table')  # Find the first table in the page
table_body = table.find('tbody')  # Extract the body of the table

# Step 5: Initialize lists to store extracted data
ip_address = []   # List for storing IP addresses
port = []         # List for storing port numbers
country = []      # List for storing country names
https_secured = []  # List for storing HTTPS support status

# Step 6: Loop through each row in the table and extract data
for tr in table_body.find_all('tr'):  # Iterate through each table row
    td_s = tr.find_all('td')  # Extract all columns in the row

    # Append extracted data to respective lists
    ip_address.append(td_s[0].text)   # First column - IP Address
    port.append(td_s[1].text)         # Second column - Port Number
    country.append(td_s[3].text)      # Fourth column - Country
    https_secured.append(td_s[6].text)  # Seventh column - HTTPS support (Yes/No)

# Step 7: Create a pandas DataFrame to store the scraped data
proxies_df = pd.DataFrame({
    'ip_address': ip_address,
    'port': port,
    'country': country,
    'https_secured': https_secured
})

# Step 8: Display the first 5 rows of the DataFrame
print(proxies_df.head())

       ip_address port        country https_secured
0    85.215.64.49   80        Germany            no
1     74.48.78.52   80  United States            no
2  50.223.246.237   80  United States            no
3    50.174.7.159   80  United States            no
4  41.207.187.178   80           Togo            no


In [None]:
V. <span style="color: #00008B;">**Databases**</span>

