<a href="https://colab.research.google.com/github/Samir-atra/share-lm_dataset_analysis/blob/main/sharelm_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analyzing the ShareLM dataset, looking for three insights:**


*   Show the difference in user interest in the model type.
*   Find the the topic related to the model, count and interest.
*   Check for the count and repetition of failure of models in a specific topic.


# **Loading and processing ShareLM dataset from Hugging Face**




In [2]:
#imports

import datasets
import os
from google import genai
import csv
import time
from transformers import AutoTokenizer
import torch # Import torch
from google.genai import types
import pandas as pd # Import pandas
import json # Import json for safer parsing


HF_token = os.environ.get('HF_TOKEN')
G_token = os.environ.get('GOOGLE_API_KEY')

ours = datasets.load_dataset("shachardon/ShareLM")["train"]
print(ours)


  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['conversation_id', 'conversation', 'model_name', 'user_id', 'timestamp', 'source', 'user_metadata', 'conversation_metadata'],
    num_rows: 349577
})


In [3]:
# counts and analysis

# Create a dictionary to store model counts
model_counts = {}
# Count of rows with valid model names
valid_model_count = 0

# Iterate through the dataset and count model names
for i in range(len(ours)):
    model_name = ours[i]["model_name"]
    if model_name != "":  # Check if model name is not an empty string
        valid_model_count += 1
        if model_name in model_counts:
            model_counts[model_name] += 1
        else:
            model_counts[model_name] = 1

# Print the count of rows with valid model names
print(f"Number of rows with valid model names: {valid_model_count}")

# for j in range(len(ours)):
#     if ours[j]["model_name"] == "":
#         print(ours[j])
#         print(j)
# Sort the model_counts dictionary by value in descending order
sorted_model_counts = dict(sorted(model_counts.items(), key=lambda item: item[1], reverse=True))

# Print the sorted model counts dictionary
print(sorted_model_counts)
print(ours[0])

Number of rows with valid model names: 10160
{'GPT-4': 256, 'https://lmarena.ai/': 122, 'ChatGPT': 82, 'gpt-4': 46, 'https://yuntian-deng-chatgpt4turbo.hf.space/?__theme=light': 45, 'gpt-4-code-interpreter': 33, 'https://claude.ai/chat/f6da33bf-631f-4943-ac08-d5f174ce3441': 28, 'https://claude.ai/chat/2964c05f-d02b-4c3c-ba78-8d9ea7b6e5ac': 27, 'https://yuntian-deng-o1.hf.space/?__theme=light': 25, 'https://chat.lmsys.org/': 21, 'https://tomoniai-mixtral-chat.hf.space/?__theme=light': 17, 'GPT-3.5': 15, 'https://chatgpt.com/': 12, 'https://huggingface.co/chat/conversation/66884ccbb687a6677d5a1df9': 10, 'https://huggingface.co/chat/conversation/66ec0e3fc0f4fe3b7986f1e7': 10, 'https://chatgpt.com/c/68827ed8-374c-8001-89f8-37f1e389818c': 10, 'gpt-4-browsing': 10, 'https://huggingfaceh4-starchat-playground.hf.space/?__theme=light': 8, 'https://open-orca-mistral-7b-openorca.hf.space/?__theme=light': 8, 'https://chatgpt.com/c/686e6c37-fd98-8001-be75-998616a07bc6': 8, 'https://chatgpt.com/c/68

In [4]:
# check for model name existance

valid_model_rows = []
for i in range(5):
    # if ours[i]["model_name"] != "":
    valid_model_rows.append(ours[i])

print(f"Number of rows with valid model names: {len(valid_model_rows)}")
# Optional: Print the first few rows with valid model names to inspect
if len(valid_model_rows) > 0:
    print("First 5 rows with valid model names:")
    for j in range(min(5, len(valid_model_rows))):
        print(valid_model_rows[j])

Number of rows with valid model names: 5
First 5 rows with valid model names:
{'conversation_id': '489874c1-ddab-458b-9d49-37e5b37c4c0c', 'conversation': [{'content': "Give the following data {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'], 'Age':[20, 21, 19, 18], 'Height' : [6.1, 5.9, 6.0, 6.1]}. Can you plot one graph with two subplots as columns. The first is a bar graph showing the height of each person. The second is a bargraph showing the age of each person? Draw the graph in seaborn talk mode.\n  ", 'role': 'user'}, {'content': 'Sure! Here\'s how to create a visualization using Seaborn that has two subplots, where the left subplot shows the heights and the right subplot shows the ages:\n\nFirst, import the necessary libraries:\n\n\n\n\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\n\nCreate a Pandas DataFrame from the given dictionary data:\n\n\n\n\ndata = {\'Name\': [\'Tom\', \'Brad\', \'Kyle\', \'Jerry\'],\n        \'Age\': [20, 21, 19, 18],\n        \

In [5]:

# Convert the Hugging Face dataset to a pandas DataFrame
df_ours = ours.to_pandas()

# Save the DataFrame to a CSV file
csv_file_path = "sharelm_dataset.csv"
df_ours.to_csv(csv_file_path, index=False)

print(f"Dataset successfully saved to {csv_file_path}")

Dataset successfully saved to sharelm_dataset.csv


# **Display part of the dataset**

In [6]:

# Specify the path to the CSV file
csv_file_path = "sharelm_dataset_processing_progress.csv"

# Read the CSV file into a pandas DataFrame
df_with_topics = pd.read_csv(csv_file_path, low_memory=False)

pd.set_option('display.max_rows', None)  # Display all rows
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1500)  # Adjust width as needed
pd.set_option('display.colheader_justify', 'left')  # Optional: Align column headers to the left

# Display the first 5 rows of the DataFrame
print("First 5 rows of the dataset with topics:")
display(df_with_topics[1800:1900]["conversation_metadata"])
os.path.exists("sharelm_dataset_processing_progress.csv")


First 5 rows of the dataset with topics:


1800                                                                    {'rate': '', 'language': '', 'redacted': '', 'toxic': '', 'title': 'Komplexe Konzepte und Übersetzungen', 'custom_instruction': 'True', 'status': '', 'topic': 'analysis/decision'}
1801                                                                                        {'rate': '', 'language': '', 'redacted': '', 'toxic': '', 'title': 'Bats are not blind.', 'custom_instruction': 'False', 'status': '', 'topic': 'factual info'}
1802                                                                      {'rate': '', 'language': '', 'redacted': '', 'toxic': '', 'title': '`memmove` for Overlapping Memory', 'custom_instruction': 'False', 'status': '', 'topic': 'explanationcoding'}
1803                                                                         {'rate': '', 'language': '', 'redacted': '', 'toxic': '', 'title': 'Red Neuronal Simple Explicada', 'custom_instruction': 'False', 'status': '', 'topic': 'analysis/dec

True

# **Download the dataset as CSV**

In [None]:
from google.colab import files

# Specify the path to the CSV file you saved
csv_file_path = "sharelm_dataset.csv"

try:
  files.download(csv_file_path)
  print(f"Initiated download for {csv_file_path}. Check your browser's downloads.")
except Exception as e:
  print(f"An error occurred during download: {e}")

# **The quota update function**

In [15]:
def check_and_update_quota(tokens_used):
    """
    Checks if performing an action would exceed quotas and updates usage.

    Args:
        tokens_used: The number of tokens the current action would use.

    Returns:
        True if the action is within quotas, False otherwise, sleeps if necessary.
    """
    global requests_this_minute, tokens_this_minute, requests_today
    global start_time_minute, start_time_day

    current_time = time.time()

    # Calculate time elapsed in the current minute and day
    time_elapsed_this_minute = current_time - start_time_minute
    time_elapsed_today = current_time - start_time_day

    # Reset minute counts if a minute has passed
    if time_elapsed_this_minute >= 60:
        requests_this_minute = 0
        tokens_this_minute = 0
        start_time_minute = current_time
        time_elapsed_this_minute = 0 # Reset elapsed time for the new minute

    # Reset daily counts if a day has passed (86400 seconds in a day)
    if time_elapsed_today >= 86400:
        requests_today = 0
        start_time_day = current_time
        time_elapsed_today = 0 # Reset elapsed time for the new day


    # Check if RPD limit is reached
    if requests_today >= RPD_LIMIT:
        print("RPD limit exceeded. Cannot make more requests today.")
        return False

    # Calculate time needed before the next request based on RPM and TPM
    # Ensure we don't divide by zero if limits are zero
    time_needed_rpm = 0
    if RPM_LIMIT > 0:
        # Calculate remaining capacity for requests in the current minute
        remaining_requests_in_minute = RPM_LIMIT - requests_this_minute - 1
        if remaining_requests_in_minute < 0:
             # If adding this request exceeds RPM, calculate time until next minute reset
             time_needed_rpm = 60 - time_elapsed_this_minute


    time_needed_tpm = 0
    if TPM_LIMIT > 0 and tokens_used > 0:
        # Calculate remaining capacity for tokens in the current minute
        remaining_tokens_in_minute = TPM_LIMIT - tokens_this_minute - tokens_used
        if remaining_tokens_in_minute < 0:
            # If adding these tokens exceeds TPM, calculate time until next minute reset
             time_needed_tpm = 60 - time_elapsed_this_minute

    # Determine the maximum time needed based on both limits and remaining time in the minute
    sleep_duration = max(time_needed_rpm, time_needed_tpm)

    # Ensure sleep duration is non-negative
    sleep_duration = max(0, sleep_duration)


    if sleep_duration > 0:
        print(f"Quota limit approaching. Sleeping for {sleep_duration:.2f} seconds to stay within limits.")
        time.sleep(sleep_duration)
        # After sleeping, update the current time and elapsed time for the minute
        current_time = time.time()
        start_time_minute = current_time # Reset start time for the new minute after sleeping
        time_elapsed_this_minute = 0
        requests_this_minute = 0 # Reset counts after sleeping for a new minute
        tokens_this_minute = 0


    # If within minute quotas and RPD, update usage
    requests_this_minute += 1
    tokens_this_minute += tokens_used
    requests_today += 1
    return True

## **Creating and adding a topic column to the dataset using Gemma models**

##Classifications to use in creating the "Topic" column in the dataset:

*   **WildChat:** [assisting/creative writing, analysis/decision, explanationcoding, factual info, math reason]
*   **WebDS: Attributes** [Multihop, Structured, Unstructured (text), Unstructured (nontext), Question-Answer, Multi-Website, Action-Based, Tool-usage], **Domains** [Demographics & Policy, E-Commerce & Forums, Economics/Markets, Energy & Climate, Health, Higher Education, Music, Scientific Research, Sports, (Tourism, Trade, Airlines)]

*   



In [None]:
# Check if GPU is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

responses_generated = 1
# Quota limits provided by the user
RPM_LIMIT = 30      # Requests Per Minute
TPM_LIMIT = 15000   # Tokens Per Minute
RPD_LIMIT = 14400   # Requests Per Day

# Variables to track current usage
requests_this_minute = 0
tokens_this_minute = 0
requests_today = 0

# Timestamps to track time for rate limiting
start_time_minute = time.time()
start_time_day = time.time()

MANUAL_START_ROW = 1810
MANUAL_END_ROW = 5000

# File paths for saving progress in Google Drive
progress_csv_path = f"sharelm_dataset_processing_progress.csv"
last_processed_index_path = f"last_processed_index.txt"
original_dataset_csv_path = "sharelm_dataset.csv" # Path to the original saved dataset locally

# Load progress if it exists
start_index = MANUAL_START_ROW # Default start index
if os.path.exists(last_processed_index_path):
    try:
        with open(last_processed_index_path, 'r') as f:
            # Resume from the index AFTER the last successfully processed one
            last_processed = f.read().strip()
            if last_processed:
                start_index = int(last_processed) + 1
        print(f"Resuming processing from index: {start_index}")
    except (ValueError, IOError) as e:
        print(f"Could not load or parse last processed index: {e}. Starting from the beginning of the defined chunk.")
        start_index = MANUAL_START_ROW

# If progress file exists, we will append to it. If not, it will be created.
# We will load the dataset in chunks instead of all at once.

if not os.path.exists(original_dataset_csv_path):
    print("Original dataset CSV not found locally. Loading from Hugging Face and saving to CSV (this might take time).")
    ours = datasets.load_dataset("shachardon/ShareLM")["train"]
    df_ours = ours.to_pandas()
    df_ours.to_csv(original_dataset_csv_path, index=False)
    print(f"Dataset saved to {original_dataset_csv_path}")
    del ours # Free up memory
    del df_ours


client = genai.Client(api_key=os.environ.get('GOOGLE_API_KEY'))


chunk_size = 5000
# Write header only if the file is new or we are starting from the very beginning of the manual chunk.
header = not os.path.exists(progress_csv_path) or start_index == MANUAL_START_ROW

model_name = "gemma-3n-e2b-it" # Corrected model name based on traceback
hf_model_name = "google/gemma-3n-e2b-it"
# Initialize a tokenizer using the exact model name and move to the selected device
try:
    tokenizer = AutoTokenizer.from_pretrained(hf_model_name)
    # Tokenizer doesn't have a .to(device) method, but the underlying model might.
    # However, for simple tokenization, CPU is usually sufficient and fast.
    print(f"Loaded tokenizer for model: {hf_model_name}.")
except Exception as e:
    print(f"Could not load tokenizer for {hf_model_name}: {e}. Falling back to 'gpt2' tokenizer for demonstration.")
    tokenizer = AutoTokenizer.from_pretrained('gpt2')
    print("Loaded 'gpt2' tokenizer as a fallback.")

# Convert 'conversation_metadata' to dictionary if it's a string
def parse_metadata(metadata):
    if isinstance(metadata, str):
        try:
            # Use json.loads for safer parsing of string representation of dictionary
            return json.loads(metadata.replace("'", '"')) # Handle single quotes
        except (json.JSONDecodeError, TypeError):
            # Return an empty dictionary or handle the error as appropriate
            return {}
    elif pd.isna(metadata):
        return {}
    return metadata

# Use a chunksize for reading the CSV to manage memory
# We calculate the number of rows to skip. The +1 is because skiprows is 0-indexed and includes the header.
rows_to_skip = range(1, start_index) if start_index > 0 else None

df_chunk_iter = pd.read_csv(
    progress_csv_path,
    chunksize=chunk_size,
    dtype={'conversation_metadata': 'object'},
    skiprows=rows_to_skip,
    nrows=chunk_size
)

print(f"Starting processing loop from index {start_index} up to {MANUAL_END_ROW}.")
# This flag will be used to break the outer loop
processing_stopped = False

for i, chunk_df in enumerate(df_chunk_iter):
    # Set the DataFrame index to match the absolute index in the original CSV
    # The first chunk's index starts at `start_index`
    chunk_df.index = range(start_index + i * chunk_size, start_index + i * chunk_size + len(chunk_df))

    # conversation_metadata = parse_metadata(chunk_df['conversation_metadata'][i+start_index])



    # Process rows starting from the last processed index within the defined chunk
    for index, row in chunk_df.iterrows():
        conversation_metadata = parse_metadata(chunk_df['conversation_metadata'][index+start_index])
        if index >= MANUAL_END_ROW:
            print(f"Reached manual end row {MANUAL_END_ROW}. Stopping.")
            processing_stopped = True
            break # Stop if we've reached the end of our processing window

        # Access the 'conversation' column
        try:
            conversation = row["conversation"]
        except KeyError:
            # Fallback to index if column name is not found
            conversation = row.iloc[1] # Assuming 'conversation' is the second column


        contents = f"""Analyze the following conversation text and classify it as one of the following classes in the comma-separated list [assisting/creative writing, analysis/decision, explanationcoding, factual info, math reason].

        Return ONLY one word referring to the label.

        Conversation: {conversation}
        """
        # Use the tokenizer to get the exact token count
        estimated_tokens_for_prompt = len(tokenizer.encode(contents))
        print(f"Processing row {index}")
        # Check quota before making the API call
        if check_and_update_quota(estimated_tokens_for_prompt):
            max_retries = 5
            response = None # Initialize response to None
            for retry_count in range(max_retries):
                try:
                    response = client.models.generate_content(
                    model=model_name,
                    contents=contents,
                    )
                    break
                except Exception as e:
                    print(f"Original metadata for row {index}: {conversation_metadata}")
                    print(f"API call failed for row {index} (Attempt {retry_count + 1}/{max_retries}): {e}")
                    if retry_count < max_retries - 1:
                        sleep_time = 2 ** retry_count # Exponential backoff
                        print(f"Retrying in {sleep_time} seconds...")
                        time.sleep(sleep_time)
                    else:
                        print(f"Max retries reached for row {index}. Skipping.")
                        # print("this is chunck", type(chunk_df.at[index, 'conversation_metadata']))
                        # conversation_metadata = chunk_df.at[index, 'conversation_metadata']
                        conversation_metadata['topic'] = "Error during classification (Failed retries)"
                        chunk_df.at[index, 'conversation_metadata'] = conversation_metadata


            if response is not None and response.text:
                classified_topic = response.text.strip()
                print(f"Processed row {index}: {classified_topic}")
                
                # metadata = chunk_df.at[index, 'conversation_metadata']
                print(f"Original metadata for row {index}: {conversation_metadata}")

                conversation_metadata['topic'] = classified_topic
                chunk_df.at[index, 'conversation_metadata'] = conversation_metadata
                print(f"Updated metadata for row {index}: {conversation_metadata}")

        
        else:
            print(f"Stopping processing at index {index} due to quota limit.")
            processing_stopped = True
            break # Stop processing if quota is exceeded
    
    # --- Save Progress After Each Chunk ---
    # Determine which rows from the chunk were actually processed in this run
    # `index` will hold the last index processed or attempted in the inner loop
    last_processed_index_in_chunk = index
    if processing_stopped and not check_and_update_quota(0): # If stopped by quota, the last item failed
        last_processed_index_in_chunk = index - 1

    # Get the slice of the dataframe that was successfully processed
    processed_chunk_df = chunk_df.loc[chunk_df.index[0]:last_processed_index_in_chunk]

    if not processed_chunk_df.empty:
        # Convert metadata back to string for CSV storage
        processed_chunk_df['conversation_metadata'] = processed_chunk_df['conversation_metadata'].apply(json.dumps)
        processed_chunk_df.to_csv(progress_csv_path, mode='a', header=header, index=False)
        header = False # Header is written, don't write it again
        with open(last_processed_index_path, 'w') as f:
            f.write(str(last_processed_index_in_chunk))
        print(f"Saved processed chunk up to index {last_processed_index_in_chunk}")

    if processing_stopped:
        break # Break the outer loop if processing was stopped

print("\nProcessing finished.")

# The final processed data is in the progress CSV. You can load it to inspect.
if os.path.exists(progress_csv_path):
    print("\nDisplaying first 5 rows from the progress file:")
    df_processed = pd.read_csv(progress_csv_path)
    display(df_processed.head())
else:
    print("No processing was done or progress file was not created.")

Using device: cpu
Resuming processing from index: 1811
Loaded tokenizer for model: google/gemma-3n-e2b-it.
Starting processing loop from index 1811 up to 5000.
Processing row 1811
Quota limit approaching. Sleeping for 57.98 seconds to stay within limits.
Original metadata for row 1811: {'rate': '', 'language': '', 'redacted': '', 'toxic': '', 'title': '', 'custom_instruction': '', 'status': 'chosen'}
API call failed for row 1811 (Attempt 1/5): 429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.', 'status': 'RESOURCE_EXHAUSTED', 'details': [{'@type': 'type.googleapis.com/google.rpc.QuotaFailure', 'violations': [{'quotaMetric': 'generativelanguage.googleapis.com/generate_content_free_tier_input_token_count', 'quotaId': 'GenerateContentInputTokensPerModelPerMinute-FreeTier', 'quotaDimensions': {'location': 'glo


*   re-structure the pipeline to find the topic of each thousands and create a new version of the dataset.
*   need to optimize the quota delay
*   create a file to take notes of the amount processed, and continue from there automatically with gemini. or a loop for the missing.
*   use tensorflow to build the pipeline instead of pandas and other CPU libraries, (not possible)



# Task
Analyze the dataset "dataset.jsonl" to understand the distribution of models used, languages, user contributions, and conversation lengths. Create the following visualizations:
1. A horizontal bar chart showing the top 20 most frequent models, with a subplot of a scatter plot showing individual model counts. Display "N/A" as the most used model if applicable, and print the name of the most used model separately.
2. A horizontal bar chart showing the frequency of models with names (excluding the most used model), with a subplot of a scatter plot showing individual model counts.
3. A horizontal bar chart showing the frequency of languages, with a subplot of a scatter plot showing individual language counts.
4. A horizontal bar chart showing the top users by contribution count, with a subplot of a scatter plot showing individual user contribution counts.
5. A horizontal histogram showing the distribution of conversation lengths, with increased scale numbers in the first thousand and specific numbers written on each bar, and a subplot of a scatter plot showing individual conversation lengths.
6. A more detailed horizontal histogram showing the distribution of conversation lengths between 0 and 1000, with numbers written on each bar, and a subplot of a scatter plot showing individual conversation lengths.

## Refine subplot layout and axes

### Subtask:
Review and adjust the layout and axes of all subplots to ensure they are clearly presented and aligned with their corresponding histograms.


**Reasoning**:
Examine the plots generated in the previous steps to assess the alignment and clarity of the subplots and their axes. Based on this assessment, the axes limits of the scatter plots need to be adjusted to align better with the corresponding horizontal bar charts and histograms.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Convert the dictionary to a pandas Series for easier plotting
model_counts_series = df_ours['model_name'].value_counts()

# Sort the series by count for better visualization and select the top 20
model_counts_series = model_counts_series.sort_values(ascending=False).head(20)

# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# First subplot: Horizontal bar chart of the top 20 model counts
model_counts_series.plot(kind='barh', ax=axes[0])
axes[0].set_title('Top 20 Most Frequent Models in ShareLM Dataset (Bar Chart)')
axes[0].set_xlabel('Count')
axes[0].set_ylabel('Model Name')
axes[0].set_ylim(-0.5, len(model_counts_series) - 0.5) # Adjust y-axis limits for bar chart

# Add the most used model name as text annotation in the bar chart
most_used_model_name = model_counts_series.index[0]
most_used_model_count = model_counts_series.iloc[0]
axes[0].text(most_used_model_count + 1, 0, f'{most_used_model_name} ({most_used_model_count})', va='center', ha='left', fontsize=10, color='black')

# Second subplot: Scatter plot of individual model counts
axes[1].scatter(model_counts_series.values, range(len(model_counts_series)))
axes[1].set_title('Individual Model Counts (Scatter Plot)')
axes[1].set_xlabel('Count')
axes[1].set_ylabel('Model Name')
axes[1].set_yticks(range(len(model_counts_series)))
axes[1].set_yticklabels(model_counts_series.index)
axes[1].invert_yaxis() # Invert y-axis to match bar chart order
axes[1].set_ylim(-0.5, len(model_counts_series) - 0.5) # Match y-axis limits with bar chart
axes[1].set_xlim(axes[0].get_xlim()) # Match x-axis limits with bar chart


plt.tight_layout()
plt.show()

# Print the name of the most used model separately
print(f"Most used model: {most_used_model_name}")

**Reasoning**:
Continue adjusting the layout and axes for the filtered model counts plot to ensure the scatter plot aligns with the bar chart.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter out the most used model and models with empty names
filtered_model_counts_series = model_counts_series[1:] # Exclude the first (most used) model
filtered_model_counts_series = filtered_model_counts_series[filtered_model_counts_series.index != ''] # Exclude empty names

# Select the top 20 from the filtered list
filtered_model_counts_series = filtered_model_counts_series.head(20)

# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# First subplot: Horizontal bar chart of the filtered top 20 model counts
filtered_model_counts_series.plot(kind='barh', ax=axes[0])
axes[0].set_title('Top 20 Most Frequent Models (Excluding Most Used and Empty Names) (Bar Chart)')
axes[0].set_xlabel('Count')
axes[0].set_ylabel('Model Name')
axes[0].set_ylim(-0.5, len(filtered_model_counts_series) - 0.5) # Adjust y-axis limits for bar chart

# Second subplot: Scatter plot of individual filtered model counts
axes[1].scatter(filtered_model_counts_series.values, range(len(filtered_model_counts_series)))
axes[1].set_title('Individual Filtered Model Counts (Scatter Plot)')
axes[1].set_xlabel('Count')
axes[1].set_ylabel('Model Name')
axes[1].set_yticks(range(len(filtered_model_counts_series)))
axes[1].set_yticklabels(filtered_model_counts_series.index)
axes[1].invert_yaxis() # Invert y-axis to match bar chart order
axes[1].set_ylim(-0.5, len(filtered_model_counts_series) - 0.5) # Match y-axis limits with bar chart
axes[1].set_xlim(axes[0].get_xlim()) # Match x-axis limits with bar chart


plt.tight_layout()
plt.show()

**Reasoning**:
Adjust the layout and axes for the language counts plot to ensure the scatter plot aligns with the bar chart.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Count language occurrences
language_counts_series = df_ours['conversation_metadata'].apply(lambda x: x.get('language', 'None')).value_counts()


# Sort the series by count for better visualization and select the top 10
language_counts_series = language_counts_series.sort_values(ascending=False).head(10)

# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# First subplot: Horizontal bar chart of the top 10 language counts
language_counts_series.plot(kind='barh', ax=axes[0])
axes[0].set_title('Top 10 Most Frequent Languages in ShareLM Dataset (Bar Chart)')
axes[0].set_xlabel('Count')
axes[0].set_ylabel('Language')
axes[0].set_ylim(-0.5, len(language_counts_series) - 0.5) # Adjust y-axis limits for bar chart

# Second subplot: Scatter plot of individual language counts
axes[1].scatter(language_counts_series.values, range(len(language_counts_series)))
axes[1].set_title('Individual Language Counts (Scatter Plot)')
axes[1].set_xlabel('Count')
axes[1].set_ylabel('Language')
axes[1].set_yticks(range(len(language_counts_series)))
axes[1].set_yticklabels(language_counts_series.index)
axes[1].invert_yaxis() # Invert y-axis to match bar chart order
axes[1].set_ylim(-0.5, len(language_counts_series) - 0.5) # Match y-axis limits with bar chart
axes[1].set_xlim(axes[0].get_xlim()) # Match x-axis limits with bar chart


plt.tight_layout()
plt.show()

**Reasoning**:
Adjust the layout and axes for the user contributions plot to ensure the scatter plot aligns with the bar chart.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Count user contributions
user_counts_series = df_ours['user_id'].value_counts()

# Select the top 20 user contributions
user_counts_series = user_counts_series.sort_values(ascending=False).head(20)

# Create a figure with two subplots arranged in a single row
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# First subplot: Horizontal bar chart of the top 20 user contributions
user_counts_series.plot(kind='barh', ax=axes[0])
axes[0].set_title('Top 20 Most Frequent User IDs in ShareLM Dataset (Bar Chart)')
axes[0].set_xlabel('Number of Contributions')
axes[0].set_ylabel('User ID')
axes[0].set_ylim(-0.5, len(user_counts_series) - 0.5) # Adjust y-axis limits for bar chart

# Second subplot: Scatter plot of individual user contribution counts
axes[1].scatter(user_counts_series.values, range(len(user_counts_series)))
axes[1].set_title('Individual User Contribution Counts (Scatter Plot)')
axes[1].set_xlabel('Number of Contributions')
axes[1].set_ylabel('User ID')

# Set y-axis tick locations and labels for the scatter plot to display User IDs
axes[1].set_yticks(range(len(user_counts_series)))
axes[1].set_yticklabels(user_counts_series.index)

# Invert the y-axis of the scatter plot to match the order of the horizontal bar chart
axes[1].invert_yaxis()
axes[1].set_ylim(-0.5, len(user_counts_series) - 0.5) # Match y-axis limits with bar chart
axes[1].set_xlim(axes[0].get_xlim()) # Match x-axis limits with bar chart


# Adjust subplot parameters for a tight layout
plt.tight_layout()

# Display the figure
plt.show()

**Reasoning**:
Adjust the layout and axes for the overall conversation length histogram to ensure the scatter plot aligns with the histogram bins.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate conversation lengths
conversation_lengths = df_ours['conversation'].apply(lambda x: len(x))


# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# First subplot: Horizontal histogram of conversation lengths
n, bins, patches = axes[0].hist(conversation_lengths, bins=[i for i in range(0, 1001, 50)] + [max(conversation_lengths)], orientation='horizontal') # Increased scale for first 1000, then one large bin, added orientation
axes[0].set_title('Distribution of Conversation Lengths in ShareLM Dataset (Histogram)')
axes[0].set_xlabel('Frequency') # Swapped labels
axes[0].set_ylabel('Conversation Length (Number of turns)') # Swapped labels
axes[0].set_ylim(min(bins), max(bins)) # Set y-axis limits to match histogram bins

# Add text labels on each bar in the histogram
for patch in patches:
    x, y = patch.get_xy()
    width = patch.get_width()
    height = patch.get_height()
    if width > 0: # Only label bars with frequency > 0
        axes[0].text(x + width, y + height/2, int(width), va='center', ha='left') # Adjusted text position for horizontal bars

# Second subplot: Horizontal Scatter plot of individual conversation lengths
axes[1].scatter(conversation_lengths, range(len(conversation_lengths)), alpha=0.5) # Swapped x and y for horizontal scatter
axes[1].set_title('Individual Conversation Lengths (Scatter Plot)')
axes[1].set_xlabel('Conversation Length (Number of turns)') # Swapped labels
axes[1].set_ylabel('Conversation Index') # Swapped labels
# Since the scatter plot y-axis represents index, aligning it directly with histogram bins is not straightforward.
# We will match the y-axis range to the total number of conversations for now.
axes[1].set_ylim(0, len(conversation_lengths))
axes[1].set_xlim(axes[0].get_ylim()) # Match x-axis limits with histogram y-axis (conversation length)


plt.tight_layout()
plt.show()

**Reasoning**:
Adjust the layout and axes for the detailed conversation length histogram (0-1000 turns) to ensure the scatter plot aligns with the histogram bins and the relevant conversation lengths.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter conversation lengths to include only those between 0 and 1000
short_conversation_lengths = df_ours['conversation'].apply(lambda x: len(x)).loc[lambda x: (x >= 0) & (x <= 1000)].tolist()


# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# First subplot: More detailed horizontal histogram for short conversation lengths
n, bins, patches = axes[0].hist(short_conversation_lengths, bins=100, orientation='horizontal') # Increased number of bins for more detail, added orientation
axes[0].set_title('Distribution of Conversation Lengths (0-1000 turns) in ShareLM Dataset (Histogram)')
axes[0].set_xlabel('Frequency') # Swapped labels
axes[0].set_ylabel('Conversation Length (Number of turns)') # Swapped labels
axes[0].set_ylim(0, 1000) # Set y-axis limits to match the 0-1000 range

# Add text labels on each bar (optional, depending on how crowded it gets)
for patch in patches:
    x, y = patch.get_xy()
    width = patch.get_width()
    height = patch.get_height()
    if width > 0:
        axes[0].text(x + width, y + height/2, int(width), va='center', ha='left', fontsize=8)

# Second subplot: Horizontal Scatter plot of individual short conversation lengths
axes[1].scatter(short_conversation_lengths, range(len(short_conversation_lengths)), alpha=0.5) # Swapped x and y for horizontal scatter
axes[1].set_title('Individual Short Conversation Lengths (0-1000 turns) (Scatter Plot)')
axes[1].set_xlabel('Conversation Length (Number of turns)') # Swapped labels
axes[1].set_ylabel('Conversation Index (within 0-1000 range)') # Swapped labels
# Since the scatter plot y-axis represents index, aligning it directly with histogram bins is not straightforward.
# We will set y-axis limits based on the number of short conversations.
axes[1].set_ylim(0, len(short_conversation_lengths))
axes[1].set_xlim(0, 1000) # Set x-axis limits to match the 0-1000 range of the histogram


plt.tight_layout()
plt.show()

* Next, "N/A" refers to unknown or an empty string when typing.
## **Summary:**

### **Data Analysis Key Findings**

*   The most used model in the dataset is "N/A", i.e. does not have a recorded name in the dataset and that is because the conversation was collected from another dataset and not using the plugin.
*   The top 20 most frequent models include "N/A" and several named models, with counts decreasing sharply after the top few.
*   When excluding "N/A" and empty model names, a different set of top 20 models emerges, showing the distribution among specific models, and the most used ones are GPT with favor to the latest.
*   The dataset contains conversations in multiple languages, with a significant concentration in the top 10, the language documentation is so limited and english is the dominant.
*   User contributions are highly skewed, with a few users contributing a large number of conversations while many users contribute less.
*   Conversation lengths vary widely, with a large number of short conversations (0-1000 turns) and a long tail of much longer conversations.
*   A more detailed view of conversations between 0 and 1000 turns shows the specific distribution within this range.
*   The data collected using the plugin is around 10,000 and the rest of the dataset which amounts to 300,000 is from other datasets, so finding the metadata for the conversations is only in plugin data and not the rest.


### **Insights or Next Steps**

*   Further investigation into the "N/A" model category could reveal reasons for its prevalence and potential data collection or labeling issues.
*   Analyzing the distribution of languages and user contributions can help understand the diversity and activity levels within the dataset.
*   Manually adding the metadata to the empty fields by infering them from the dataset name ant public setup can make a big improvment to the dataset overall.


# **References**


1.   Don-Yehiya S, Choshen L, Abend O. The ShareLM collection and plugin: contributing human-model chats for the benefit of the community. arXiv preprint arXiv:2408.08291. 2024 Aug 15.

2.   Meyer S, Elsweiler D. " You tell me": a dataset of GPT-4-based behaviour change support conversations. InProceedings of the 2024 Conference on Human Information Interaction and Retrieval 2024 Mar 10 (pp. 411-416).

3.   Zhao W, Ren X, Hessel J, Cardie C, Choi Y, Deng Y. Wildchat: 1m chatgpt interaction logs in the wild. arXiv preprint arXiv:2405.01470. 2024 May 2.

4.   Hsu E, Yam HM, Bouissou I, John AM, Thota R, Koe J, Putta VS, Dharesan GK, Spangher A, Murty S, Huang T. WebDS: An End-to-End Benchmark for Web-based Data Science. arXiv preprint arXiv:2508.01222. 2025 Aug 2.




