# "Unlocking the Power of Data: Building Your Own Unique and High-Quality Datasets for Machine Learning and Data Science"

Four different ways to acquire your own dataset, complete with Python code demonstrations.
Have you ever wanted to work on a data science project, but couldn't find a dataset that suited your needs? Or perhaps you're looking to gain more experience in data collection and cleaning? In either case, building your own dataset can be a rewarding and informative experience. In this article, we'll cover four ways to acquire your own dataset and provide Python code examples for each method. And outlines of a few more.

###### 1. Web Scraping
Web scraping involves extracting data from websites by automating the process of sending HTTP requests and parsing HTML responses. This method can be used to extract data from online databases, news articles, social media platforms, and other web-based sources. For example, you might scrape data from online marketplaces to build a dataset of product prices and descriptions.
Here's an example of how to scrape data from a website using Python's requests and BeautifulSoup libraries:

In [6]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

table = soup.find_all('table')[0]

data = pd.read_html(str(table))[0]
data.to_csv('country_population.csv', index=False)
print(data.head())


  Country / Area UN continental region[4] UN statistical subregion[4]  \
0          India                     Asia               Southern Asia   
1       China[a]                     Asia                Eastern Asia   
2  United States                 Americas            Northern America   
3      Indonesia                     Asia          South-eastern Asia   
4       Pakistan                     Asia               Southern Asia   

   Population (1 July 2022)  Population (1 July 2023)  Change  
0                1417173173                1428627663  +0.81%  
1                1425887337                1425671352  −0.02%  
2                 338289857                 339996564  +0.50%  
3                 275501339                 277534123  +0.74%  
4                 235824863                 240485658  +1.98%  


###### Detailed Code Explanation
First, we import the necessary libraries: requests, pandas, and BeautifulSoup. requests is a library for making HTTP requests in Python, pandas is a library for data manipulation and analysis, and BeautifulSoup is a library for parsing HTML and XML documents.
We then define the URL of the webpage we want to scrape using the url variable. In this example, we are scraping the Wikipedia page for a list of countries by population.
Next, we use the requests.get() function to retrieve the webpage content as a response object. We store this response object in the response variable.
We then create a BeautifulSoup object from the response content using the BeautifulSoup() function. We pass the response.content as the first argument and 'html.parser' as the second argument to specify that we want to parse an HTML document. We store this BeautifulSoup object in the soup variable.
Next, we use the soup.find_all() method to find all the tables in the HTML document. We pass 'table' as the argument to specify that we want to find all the tables. We then select the first table in the list using [0] and store it in the table variable.
We then use the pd.read_html() function to parse the HTML table and convert it to a Pandas dataframe. We pass the str(table)  as the argument to specify that we want to read the HTML table as a  string. We then select the first dataframe in the list using [0] and store it in the data variable.
We save the data dataframe to a CSV file named 'country_population.csv' in the current working directory using the to_csv() method. We set the index argument to False to avoid writing the row indices to the CSV file.
Finally, we print the first few rows of the data dataframe using the head() method to verify that the data was scraped and converted to a dataframe correctly.

###### 2. Data Generation with "Faker"
Data generation involves creating synthetic datasets from scratch or modifying existing datasets to include additional data points. This method can be useful for testing models, augmenting existing datasets, or simulating scenarios that are difficult to observe in real life. For example, you might generate a dataset of fake customer transactions to test a fraud detection algorithm.
Here's an example of how to generate a dataset of random customer transactions using Python's random and faker libraries:

In [12]:
from random import randint, choice, random
from faker import Faker

faker = Faker()

transactions = []
for i in range(5):
    customer_id = randint(1000, 9999)
    timestamp = faker.date_time_between(start_date='-1y', end_date='now').strftime('%Y-%m-%d %H:%M:%S')
    product = choice(['widget', 'gizmo', 'thingamajig'])
    price = round(randint(1, 100) + random(), 2)
    transactions.append({'Customer ID': customer_id, 'Timestamp': timestamp, 'Product': product, 'Price': price})

for transaction in transactions:
    print(f"Customer ID: {transaction['Customer ID']}\nTimestamp: {transaction['Timestamp']}\nProduct: {transaction['Product']}\nPrice: {transaction['Price']}\n")


Customer ID: 6898
Timestamp: 2022-07-28 20:30:56
Product: widget
Price: 79.07

Customer ID: 6605
Timestamp: 2023-03-13 08:23:23
Product: widget
Price: 15.16

Customer ID: 9863
Timestamp: 2022-11-11 17:36:05
Product: widget
Price: 52.26

Customer ID: 3051
Timestamp: 2022-07-09 17:41:53
Product: gizmo
Price: 21.06

Customer ID: 7757
Timestamp: 2022-06-02 08:28:32
Product: widget
Price: 81.39



###### Detailed Code Explanation

In [47]:
from random import randint, choice, random
from faker import Faker

The first line imports three functions, randint, choice, and random from the random module. The randint function generates a random integer between two given numbers, choice returns a random element from a list, and random generates a random float between 0 and 1. The second line imports the Faker class from the faker module, which is used to generate fake data.

In [None]:
faker = Faker()

This line creates an instance of the Faker class and assigns it to the variable faker. The Faker class provides methods for generating fake data in a variety of formats, including names, addresses, phone numbers, and more.

In [None]:
transactions = []

This line creates an empty list called transactions, which will be used to store the generated transaction data.

In [None]:
for i in range(5):

This line starts a for loop that will run five times, generating five transactions.

In [None]:
customer_id = randint(1000, 9999)

This line generates a random integer between 1000 and 9999 and assigns it to the variable customer_id.

In [None]:
timestamp = faker.date_time_between(start_date='-1y', end_date='now').strftime('%Y-%m-%d %H:%M:%S')

This line generates a random timestamp between one year ago and the present using the Faker instance's date_time_between method. The resulting timestamp is formatted as a string with the %Y-%m-%d %H:%M:%S format.

In [48]:
product = choice(['widget', 'gizmo', 'thingamajig'])

This line randomly selects one of three products, widget, gizmo, or thingamajig, and assigns it to the variable product.

In [50]:
price = round(randint(1, 100) + random(), 2)

This line generates a random price for the transaction by adding a random integer between 1 and 100 to a random float between 0 and 1, rounded to two decimal places, and assigns it to the variable price.
transactions.append({'Customer ID': customer_id, 'Timestamp': timestamp, 'Product': product, 'Price': price})

In [52]:
transactions.append({'Customer ID': customer_id, 'Timestamp': timestamp, 'Product': product, 'Price': price})

This line creates a dictionary containing the generated customer ID, timestamp, product, and price, and appends it to the transactions list.

In [53]:
for transaction in transactions:
    print(f"Customer ID: {transaction['Customer ID']}\nTimestamp: {transaction['Timestamp']}\nProduct: {transaction['Product']}\nPrice: {transaction['Price']}\n")

Customer ID: 6898
Timestamp: 2022-07-28 20:30:56
Product: widget
Price: 79.07

Customer ID: 6605
Timestamp: 2023-03-13 08:23:23
Product: widget
Price: 15.16

Customer ID: 9863
Timestamp: 2022-11-11 17:36:05
Product: widget
Price: 52.26

Customer ID: 3051
Timestamp: 2022-07-09 17:41:53
Product: gizmo
Price: 21.06

Customer ID: 7757
Timestamp: 2022-06-02 08:28:32
Product: widget
Price: 81.39

Customer ID: 7757
Timestamp: 2022-06-02 08:28:32
Product: thingamajig
Price: 71.67

Customer ID: 7757
Timestamp: 2022-06-02 08:28:32
Product: thingamajig
Price: 71.67



Finally, this code loops over the transactions list and  prints out each transaction's customer ID, timestamp, product, and price  in a formatted string with newlines between each item. This creates a  user-friendly output of the generated data.

###### 3. Generating a Random Dataset Using NumPy and Pandas
Here's an example of how to generate a simple dataset using the NumPy library:

In [54]:
import numpy as np
import pandas as pd

# Generate random data
data = np.random.randn(100, 4)

# Create a pandas dataframe
df = pd.DataFrame(data, columns=['Column 1', 'Column 2', 'Column 3', 'Column 4'])

# Save the dataframe to a CSV file
df.to_csv('my_dataset.csv', index=False)

# Print the first few rows of the dataframe
print(df.head())

   Column 1  Column 2  Column 3  Column 4
0  0.084719  0.616583 -0.402154 -0.475159
1  0.448959 -2.281844 -0.024368 -1.499126
2  0.892911  0.135883 -0.237427  1.478178
3 -2.069098 -1.036908  0.598262  0.607105
4 -0.903280  1.061673 -0.750074  0.504612


###### Detailed Code Explanation

###### First, 
we import the NumPy and Pandas libraries using the import  statement. NumPy is a library for numerical computing in Python, while  Pandas is a library for data manipulation and analysis. We then generate  a 100x4 NumPy array of random numbers using the np.random.randn() function. The randn() function generates an array of random numbers from the standard normal distribution.

###### Next, 
we create a Pandas dataframe from the NumPy array using the pd.DataFrame() function. We pass the NumPy array as the first argument, and a list of column names as the columns argument. In this example, we have named the columns 'Column 1', 'Column 2', 'Column 3', and 'Column 4'.

###### We save 
The Pandas dataframe to a CSV file using the to_csv() method. We pass the filename 'my_dataset.csv' as the first argument, and set the index argument to False to avoid writing the row indices to the CSV file.

###### Finally, 
We print the first few rows of the Pandas dataframe using the head()  method, which returns the first 5 rows by default. This allows us to  verify that the dataset was generated correctly and to get a glimpse of  the data.

###### 4. Data Synthesis
This code loads two heart disease datasets from the UCI Machine Learning Repository, combines them vertically, cleans and transforms the data, and scales the continuous variables. Finally, it prints the first 5 rows of the synthesized dataset.
The output will be a pandas DataFrame with 32 columns and an arbitrary number of rows, depending on how many rows were present in the original datasets:

In [46]:
import pandas as pd
import numpy as np

# Load the two datasets
df1 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", header=None, na_values="?")
df2 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data", header=None, na_values="?")

# Combine the two datasets vertically
df = pd.concat([df1, df2], axis=0)

# Rename columns
df.columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]

# Replace missing values with the median of each column
df.fillna(df.median(), inplace=True)

# Convert categorical variables to dummy variables
cp_dummies = pd.get_dummies(df["cp"], prefix="cp")
slope_dummies = pd.get_dummies(df["slope"], prefix="slope")
thal_dummies = pd.get_dummies(df["thal"], prefix="thal")
df = pd.concat([df, cp_dummies, slope_dummies, thal_dummies], axis=1)
df.drop(["cp", "slope", "thal"], axis=1, inplace=True)

# Scale continuous variables
continuous_vars = ["age", "trestbps", "chol", "thalach", "oldpeak"]
df[continuous_vars] = (df[continuous_vars] - df[continuous_vars].mean()) / df[continuous_vars].std()

# Show the first 5 rows of the synthesized dataset
print(df.head())


        age  sex  trestbps      chol  fbs  restecg   thalach  exang   oldpeak  \
0  1.302286  1.0  0.731945 -0.262961  1.0      2.0  0.233067    0.0  1.389362   
1  1.743088  1.0  1.584739  0.640984  0.0      2.0 -1.533539    1.0  0.640255   
2  1.743088  1.0 -0.689377 -0.331184  0.0      2.0 -0.650236    1.0  1.670278   
3 -1.562928  1.0 -0.120848  0.026983  0.0      0.0  1.789364    0.0  2.513023   
4 -1.122126  0.0 -0.120848 -0.757573  0.0      2.0  1.158433    0.0  0.546616   

    ca  ...  cp_1.0  cp_2.0  cp_3.0  cp_4.0  slope_1.0  slope_2.0  slope_3.0  \
0  0.0  ...       1       0       0       0          0          0          1   
1  3.0  ...       0       0       0       1          0          1          0   
2  2.0  ...       0       0       0       1          0          1          0   
3  0.0  ...       0       0       1       0          0          0          1   
4  0.0  ...       0       1       0       0          1          0          0   

   thal_3.0  thal_6.0  thal_7.0 

###### Detailed Code Explanation

In [None]:
import pandas as pd
import numpy as np

These are the standard imports for working with data in Python. Pandas is a popular library for working with tabular data, and NumPy is a library for working with numerical data.

In [55]:
# Load the two datasets
df1 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", header=None, na_values="?")
df2 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data", header=None, na_values="?")

These lines load two datasets from the UCI Machine Learning Repository into Pandas dataframes. Both datasets are related to heart disease diagnosis, but they come from different sources and have different formats. The header=None argument tells Pandas that the files do not contain column names, and na_values="?" specifies that missing data is represented by a question mark.

In [56]:
# Merge the datasets using column names as the key
merged_df = pd.merge(df1, df2, on=[0,1,2,3,4,5,6,7,8,9,10,11,12], how="outer")

This line merges the two datasets into a single dataframe, using the column names as the key. The on parameter specifies the columns to merge on, and how="outer" specifies that all rows from both datasets should be included in the merged dataframe.

In [None]:
# Rename the columns
merged_df.columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]

This line renames the columns in the merged dataframe to more descriptive names, using the names provided in the original datasets.

In [None]:
# Create new synthetic data by adding random noise to the original dataset
noise = np.random.normal(0, 0.1, size=merged_df.shape)
synthetic_data = merged_df + noise

These lines create a new synthetic dataset by adding random noise to the merged dataset. The np.random.normal function generates random numbers with a normal distribution, with a mean of 0 and a standard deviation of 0.1. The size parameter specifies the dimensions of the noise array, which is the same as the merged dataset.

In [None]:
# Concatenate the original and synthetic datasets
concatenated_data = pd.concat([merged_df, synthetic_data])

This line concatenates the original merged dataset with the synthetic dataset, resulting in a larger dataset with twice as many rows.

In [None]:
# Save the concatenated dataset to a CSV file
concatenated_data.to_csv("heart_disease_data.csv", index=False)

In [None]:
This line saves the concatenated dataset to a CSV file named "heart_disease_data.csv", without including the index column.

A few more different ways you can go about creating your own data:

###### Data Annotation
First up, we have data annotation. This is where you take an existing dataset and add additional information to it, such as labels or annotations, to make it more useful for a specific task. It's kind of like giving your dataset a makeover - you're taking something that's already pretty good and making it even better.

###### Manual Data Collection
Next, we have manual data collection. This is where you go out and gather data yourself, whether it's through surveys, interviews, or observations. It can be a lot of work, but it can also be really rewarding, like a scavenger hunt where you're the one setting the rules.

###### Data Augmentation
Then there's data augmentation, which is like playing dress-up with your dataset. You take an existing dataset and generate new data by making small modifications to the existing data, such as flipping an image or rotating a 3D object. It's like giving your dataset a whole new wardrobe without having to go shopping.

###### Data Labeling
Data labeling is another method, which involves assigning labels or tags to data points so that machine learning algorithms can learn from them. It's like putting name tags on all your friends at a party so that you can introduce them to someone new.

###### Crowdsourcing
If you don't feel like doing all the work yourself, you can always try crowdsourcing. This involves getting a group of people to help you collect or annotate data. It's like having your own personal army of data collectors, but without all the training and equipment.

###### Data Fusion
Finally, we have data fusion, which is like a superhero team-up for datasets. You take multiple datasets and merge them together to create a more complete and comprehensive dataset. It's like the Avengers, but for data.

So, there you have it - a few different ways you can create your own datasets. But, hey, if you want to learn more about coding and data science, why not subscribe to my blog and leave a comment asking for more examples? I promise it'll be fun and educational, like going on a treasure hunt for knowledge. So, what are you waiting for? Let's get coding!

###### Question for readers: 
Have you ever created your own dataset? What method did you use and how did it turn out?