## Loading Data and Preprocessing
Here We are loading the CSV file and checking the structure of the data to understand what we’re working with. We'll look at the differentrows, column names, and any missing values.

In [13]:
import pandas as pd

# Load our dataset
df = pd.read_csv("Agora.csv", on_bad_lines='skip')

df.head()


Unnamed: 0,Vendor,Category,Item,Item Description,Price,Origin,Destination,Rating,Remarks
0,CheapPayTV,Services/Hacking,12 Month HuluPlus gift Code,12-Month HuluPlus Codes for $25. They are wort...,0.05027025666666667 BTC,Torland,,4.96/5,
1,CheapPayTV,Services/Hacking,Pay TV Sky UK Sky Germany HD TV and much mor...,Hi we offer a World Wide CCcam Service for En...,0.152419585 BTC,Torland,,4.96/5,
2,KryptykOG,Services/Hacking,OFFICIAL Account Creator Extreme 4.2,Tagged Submission Fix Bebo Submission Fix Adju...,0.007000000000000005 BTC,Torland,,4.93/5,
3,cyberzen,Services/Hacking,VPN > TOR > SOCK TUTORIAL,How to setup a VPN > TOR > SOCK super safe enc...,0.019016783532494728 BTC,,,4.89/5,
4,businessdude,Services/Hacking,Facebook hacking guide,. This guide will teach you how to hack Faceb...,0.062018073963963936 BTC,Torland,,4.88/5,


We loaded our dataset which was in the  Agora.csv file using pandas. The dataset includes various listings from the Agora dark web marketplace, and we are focusing on structured columns like Category, Price, Rating, Origin, and Destination.
From this preview, we can see that some columns contain missing values, and the Price and Rating columns are in text format and will need cleaning.

In [14]:
# looking at the column names and data types
df.info()

# Checking for missing values
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109689 entries, 0 to 109688
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Vendor             109689 non-null  object
 1    Category          109689 non-null  object
 2    Item              109685 non-null  object
 3    Item Description  109662 non-null  object
 4    Price             109684 non-null  object
 5    Origin            99807 non-null   object
 6    Destination       60528 non-null   object
 7    Rating            109674 non-null  object
 8    Remarks           12616 non-null   object
dtypes: object(9)
memory usage: 7.5+ MB


Vendor                   0
 Category                0
 Item                    4
 Item Description       27
 Price                   5
 Origin               9882
 Destination         49161
 Rating                 15
 Remarks             97073
dtype: int64

So our dataset contains 109,689 rows and 9 columns. Every column is stored as a string (object), including numeric fields like Price and Rating, so we still  need to convert them. We can also see that there are some missing values, especially:

* Destination: ~45% missing

* Origin: ~9% missing

* Rating: only 15 missing

* Remarks: mostly empty and we not be using it so we will ignore this field

To prepare the data for machine learning, we will clean the data next, including:

Removing rows with missing essential values (like Category, Price, or Rating) and then convert Price and Rating to numeric format. We will aslo remove outliers (prices > 0.5 BTC). This wil help us filter to relevant features,

In [17]:
# Strip leading/trailing whitespace from column names
df.columns = df.columns.str.strip()
# Drop rows with missing fields/ values in "Category", "Price", and "Rating"
df_clean = df.dropna(subset=["Category", "Price", "Rating"])

# Remove " BTC" and convert Price to float
df_clean["Price"] = df_clean["Price"].str.replace(" BTC", "", regex=False)
df_clean = df_clean[df_clean["Price"].str.match(r'^[\d\.]+$')]  # Keep only numeric values
df_clean["Price"] = df_clean["Price"].astype(float)

# Convert Rating to float
df_clean["Rating"] = df_clean["Rating"].str.extract(r'([\d\.]+)').astype(float)

# Drop extreme prices > 0.5 BTC
df_clean = df_clean[df_clean["Price"] <= 0.5]

# looking at the head of our cleaned data
df_clean[["Category", "Price", "Rating", "Origin", "Destination"]].head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean["Price"] = df_clean["Price"].str.replace(" BTC", "", regex=False)


Unnamed: 0,Category,Price,Rating,Origin,Destination
0,Services/Hacking,0.05027,4.96,Torland,
1,Services/Hacking,0.15242,4.96,Torland,
2,Services/Hacking,0.007,4.93,Torland,
3,Services/Hacking,0.019017,4.89,,
4,Services/Hacking,0.062018,4.88,Torland,
