In [1]:
import pandas as pd
import random

# Function to generate a random string for code snippets, function calls, graphs, and opcodes
def random_string(prefix, length=10):
  return f"{prefix}_" + ''.join(random.choices('abcdefghijklmnopqrstuvwxyz', k=length))

# Generate dataset with 3000 rows
data = {
  'Contract ID': range(1, 3001),  # Changed range to 1-3000
  'Code Snippet': [random_string('contract', 20) for _ in range(3000)],
  'Function Call Patterns': [f"{{call{random.randint(1, 50)}, call{random.randint(51, 100)}}}" for _ in range(3000)],
  'Control Flow Graph': [random_string('graph', 10) for _ in range(3000)],
  'Opcode Sequence': [random_string('opcodes', 15) for _ in range(3000)],
  'Label': [0] * 1500 + [1] * 1500  # 1500 secure, 1500 insecure
}

# Create DataFrame
df = pd.DataFrame(data)

# Shuffle the rows to mix secure and insecure contracts
df = df.sample(frac=1).reset_index(drop=True)

# Display the first 5 rows of the dataset (optional)
display(df.head())

# Save to CSV file
df.to_csv('smart_contract_dataset.csv', index=False)


Unnamed: 0,Contract ID,Code Snippet,Function Call Patterns,Control Flow Graph,Opcode Sequence,Label
0,321,contract_kxcvnkkdehikzffrstdd,"{call22, call91}",graph_hwustzmbve,opcodes_crjjzspsaggqsoh,0
1,2831,contract_inrtorokbfgtbmqqrfjk,"{call25, call58}",graph_vaaulbmlrr,opcodes_uwnaspfahpzelcf,1
2,1565,contract_cassdopqvzbtfxnaawec,"{call27, call94}",graph_ludlnykwbz,opcodes_girbhvbardiqbhf,1
3,1863,contract_bslpgainjtjpswfnslkv,"{call17, call91}",graph_ulhotfijks,opcodes_dynbpiobgphbkaf,1
4,1280,contract_dokyhqhzzuhnfstfainq,"{call3, call74}",graph_dedlfupxux,opcodes_fvvcwjixiwbejte,0


### Explanation of the Code for Generating a Synthetic Smart Contract Dataset

This Python script creates a synthetic dataset simulating smart contracts. The dataset includes several columns relevant to smart contract analysis, with random data generated for each column. Below, I'll break down the code line by line to explain its functionality in detail.

#### 1. **Importing Libraries**

```python
import pandas as pd
import random
```

- **`pandas` (`pd`)**: A widely-used library for data manipulation and analysis. It provides data structures like DataFrames, which are ideal for handling tabular data.
- **`random`**: A standard Python library used to generate random numbers and choices, which is helpful for creating synthetic data.

#### 2. **Defining a Function to Generate Random Strings**

```python
# Function to generate a random string for code snippets, function calls, graphs, and opcodes
def random_string(prefix, length=10):
  return f"{prefix}_" + ''.join(random.choices('abcdefghijklmnopqrstuvwxyz', k=length))
```

- **`random_string` Function**: This function generates a random string of a specified length, prefixed by a given string. 
  - **`prefix`**: The initial part of the string (e.g., 'contract', 'graph').
  - **`length`**: The length of the random part of the string (default is 10 characters).
  - **`random.choices('abcdefghijklmnopqrstuvwxyz', k=length)`**: Creates a list of `length` random characters from the alphabet.
  - **`''.join(...)`**: Joins these characters into a single string.
  - **`f"{prefix}_"`**: Adds the prefix to the beginning of the string.

#### 3. **Generating the Dataset**

```python
# Generate dataset with 3000 rows
data = {
  'Contract ID': range(1, 3001),  # Changed range to 1-3000
  'Code Snippet': [random_string('contract', 20) for _ in range(3000)],
  'Function Call Patterns': [f"{{call{random.randint(1, 50)}, call{random.randint(51, 100)}}}" for _ in range(3000)],
  'Control Flow Graph': [random_string('graph', 10) for _ in range(3000)],
  'Opcode Sequence': [random_string('opcodes', 15) for _ in range(3000)],
  'Label': [0] * 1500 + [1] * 1500  # 1500 secure, 1500 insecure
}
```

- **`data` Dictionary**: Defines the structure of the dataset with 3000 rows.
  - **`'Contract ID'`**: A unique identifier for each contract, ranging from 1 to 3000.
  - **`'Code Snippet'`**: Randomly generated strings with a prefix 'contract' and a length of 20 characters.
  - **`'Function Call Patterns'`**: Random function call patterns in the format `{callX, callY}`, where `X` and `Y` are random integers.
  - **`'Control Flow Graph'`**: Randomly generated strings with a prefix 'graph' and a length of 10 characters.
  - **`'Opcode Sequence'`**: Randomly generated strings with a prefix 'opcodes' and a length of 15 characters.
  - **`'Label'`**: A binary label where the first 1500 rows are labeled `0` (secure) and the next 1500 rows are labeled `1` (insecure).

#### 4. **Creating a DataFrame**

```python
# Create DataFrame
df = pd.DataFrame(data)
```

- **`pd.DataFrame(data)`**: Converts the `data` dictionary into a pandas DataFrame, which is a tabular data structure suitable for further analysis.

#### 5. **Shuffling the Rows**

```python
# Shuffle the rows to mix secure and insecure contracts
df = df.sample(frac=1).reset_index(drop=True)
```

- **`df.sample(frac=1)`**: Randomly shuffles the rows of the DataFrame. The `frac=1` argument indicates that the entire DataFrame should be sampled.
- **`reset_index(drop=True)`**: Resets the index of the DataFrame after shuffling, dropping the old index to avoid duplication.

#### 6. **Displaying the First Few Rows (Optional)**

```python
# Display the first 5 rows of the dataset (optional)
display(df.head())
```

- **`df.head()`**: Displays the first 5 rows of the DataFrame. This step is optional but helps to verify the structure and content of the dataset.

#### 7. **Saving the DataFrame to a CSV File**

```python
# Save to CSV file
df.to_csv('smart_contract_dataset.csv', index=False)
```

- **`df.to_csv('smart_contract_dataset.csv', index=False)`**: Saves the DataFrame to a CSV file named `smart_contract_dataset.csv`. The `index=False` argument prevents pandas from writing row indices into the CSV file.

### **Summary**

This code snippet is designed to generate a synthetic dataset for smart contracts. It creates random data for various attributes and labels them as secure or insecure. The dataset is then shuffled and saved to a CSV file, which can be used for testing or developing algorithms related to smart contract analysis.

- **Random Data Generation**: The use of random strings and patterns simulates a realistic dataset without needing real smart contract data.
- **Shuffling and Saving**: Ensures that the dataset is well-mixed and ready for analysis or modeling tasks. 

By following this approach, we create a controlled environment for testing and experimentation with smart contract data, which is especially useful when real-world data is unavailable or confidential.