## Load datasets
These cells read the first 5 and last 5 rows of each CSV in `data/raw`, and print total row counts. A streaming approach is used for tails to avoid loading entire large files.

In [4]:
import pandas as pd
from pathlib import Path
from collections import deque

def head_tail_count(path, n=5, chunksize=100000):
    """Return (head_df, tail_df, total_rows) for CSV at path."""
    # Head: read first n rows (fast)
    head = pd.read_csv(path, nrows=n)
    # Tail and count: stream the file in chunks and keep the last n rows using a deque
    dq = deque(maxlen=n)
    cols = None
    total = 0
    for chunk in pd.read_csv(path, chunksize=chunksize):
        if cols is None:
            cols = chunk.columns
        total += len(chunk)
        dq.extend(chunk.values.tolist())
    tail = pd.DataFrame(list(dq), columns=cols) if cols is not None else pd.DataFrame()
    return head, tail, total

# Example usage for hc3.csv
path = Path(r"c:\\Deep Learning Project\\data\\raw\\hc3.csv")
head, tail, total = head_tail_count(path, n=5)
print('--- hc3.csv: Head ---')
display(head)
print('--- hc3.csv: Tail ---')
display(tail)
print(f'Total rows: {total}')

--- hc3.csv: Head ---


Unnamed: 0,question,human_answers,chatgpt_answers,index,source
0,"Why is every book I hear about a "" NY Times # ...","['Basically there are many categories of "" Bes...",['There are many different best seller lists t...,,reddit_eli5
1,"If salt is so bad for cars , why do we use it ...",['salt is good for not dying in car crashes an...,"[""Salt is used on roads to help melt ice and s...",,reddit_eli5
2,Why do we still have SD TV channels when HD lo...,"[""The way it works is that old TV stations got...","[""There are a few reasons why we still have SD...",,reddit_eli5
3,Why has nobody assassinated Kim Jong - un He i...,"[""You ca n't just go around assassinating the ...",['It is generally not acceptable or ethical to...,,reddit_eli5
4,How was airplane technology able to advance so...,['Wanting to kill the shit out of Germans driv...,['After the Wright Brothers made the first pow...,,reddit_eli5


--- hc3.csv: Tail ---


Unnamed: 0,question,human_answers,chatgpt_answers,index,source
0,Is rise in pressure from 116/66 to 140/80 norm...,['Hello!Welcome and thank you for asking on HC...,"[""It's not uncommon for blood pressure to fluc...",,medicine
1,What could cause a painless lump in the right ...,"['Hi, * As per my surgical experience, the iss...",['There are several possible causes of a painl...,,medicine
2,Can Acutret be given to a child for treatment ...,['Although it is difficult to comment whether ...,"[""It is not appropriate for me to recommend a ...",,medicine
3,Are BP of 119/65 and pulse of 35 causes for co...,['Welcome and thank you for asking on HCM! I h...,['It is not uncommon for people with rheumatoi...,,medicine
4,Suggest treatment for back pain after walking ...,"['Hi,Having this type of back pain at this age...","['It is not uncommon to experience back pain, ...",,medicine


Total rows: 24322


In [5]:
# gpt_generated.csv (first 5, last 5, total rows)
path = Path(r"c:\\Deep Learning Project\\data\\raw\\gpt_generated.csv")
head, tail, total = head_tail_count(path, n=5)
print('--- gpt_generated.csv: Head ---')
display(head)
print('--- gpt_generated.csv: Tail ---')
display(tail)
print(f'Total rows: {total}')

--- gpt_generated.csv: Head ---


Unnamed: 0,source,id,text
0,human,0,12 Years a Slave: An Analysis of the Film Essa...
1,human,1,20+ Social Media Post Ideas to Radically Simpl...
2,human,2,2022 Russian Invasion of Ukraine in Global Med...
3,human,3,533 U.S. 27 (2001) Kyllo v. United States: The...
4,human,4,A Charles Schwab Corporation Case Essay\n\nCha...


--- gpt_generated.csv: Tail ---


Unnamed: 0,source,id,text
0,ai,1418649,"Today, I accomplished a major feat. I stepped ..."
1,ai,1418650,As rockets rain down from the sky\nEurope trem...
2,ai,1418651,"On January 6th, 2023, the world lost a true pi..."
3,ai,1418652,A gene bank is a repository of genetic materia...
4,ai,1418655,"On the twelfth day of Christmas, My true love ..."


Total rows: 1392522


In [6]:
# kaggle_ai_human.csv (first 5, last 5, total rows)
path = Path(r"c:\\Deep Learning Project\\data\\raw\\kaggle_ai_human.csv")
head, tail, total = head_tail_count(path, n=5)
print('--- kaggle_ai_human.csv: Head ---')
display(head)
print('--- kaggle_ai_human.csv: Tail ---')
display(tail)
print(f'Total rows: {total}')

--- kaggle_ai_human.csv: Head ---


Unnamed: 0,text,generated
0,Cars. Cars have been around since they became ...,0.0
1,Transportation is a large necessity in most co...,0.0
2,"""America's love affair with it's vehicles seem...",0.0
3,How often do you ride in a car? Do you drive a...,0.0
4,Cars are a wonderful thing. They are perhaps o...,0.0


--- kaggle_ai_human.csv: Tail ---


Unnamed: 0,text,generated
0,Tie Face on Mars is really just a big misunder...,0.0
1,The whole purpose of democracy is to create a ...,0.0
2,I firmly believe that governments worldwide sh...,1.0
3,I DFN't agree with this decision because a LFT...,0.0
4,"Richard Non, Jimmy Carter, and Bob Dole and ot...",0.0


Total rows: 487235


In [7]:
# routellm_gpt4_dataset.csv (first 5, last 5, total rows)
path = Path(r"c:\\Deep Learning Project\\data\\raw\\routellm_gpt4_dataset.csv")
head, tail, total = head_tail_count(path, n=5)
print('--- routellm_gpt4_dataset.csv: Head ---')
display(head)
print('--- routellm_gpt4_dataset.csv: Tail ---')
display(tail)
print(f'Total rows: {total}')

--- routellm_gpt4_dataset.csv: Head ---


Unnamed: 0,prompt,source,gpt4_response,mixtral_response,mixtral_score
0,"I'll give you a review, can you extract the fo...",['lmsys-chat-1m'],"Sure, here's the analysis of the review:\n\n1....",Food aspects and opinion words:\n\n1. Made to ...,4
1,"Answer the following question: Claim: ""Joker m...",['flan_v2_cot'],The answer is no.\nChain of thoughts: Stan Lee...,The answer is no.\n\nChain of thoughts: While ...,5
2,TASK DEFINITION: In this task you will be give...,['flan_v2_niv2'],ZdoublexpropheciesS,"ZdoublexpropheciesS\n\nIn this task, you are a...",5
3,"Detailed Instructions: In this task, you need ...",['flan_v2_niv2'],Yes,"No, 'station' is not the longest word in the s...",5
4,A guy pick up a woman Then he puts her down Ex...,['sharegpt'],This phrase could be interpreted as a joke bec...,This joke is a play on words and relies on the...,5


--- routellm_gpt4_dataset.csv: Tail ---


Unnamed: 0,prompt,source,gpt4_response,mixtral_response,mixtral_score
0,Research shows that most black people have low...,['lmsys-chat-1m'],It's important to approach this topic with sen...,It's important to avoid making sweeping genera...,5
1,The script of the film is built around a doctr...,['lmsys-chat-1m'],Title: The Book of Purification\n\nChapter 1: ...,"In the name of the One True God, we welcome yo...",5
2,Tell me about revenue development of Microsoft...,['lmsys-chat-1m'],"I don't have real-time data, but I can provide...","Sure, I'd be happy to help with that! Microsof...",5
3,I have a youtube video embed. Is there a way t...,['sharegpt'],"Yes, there is a way to hide the channel avatar...","Yes, there is a way to customize the appearanc...",5
4,what are good shows for kids like sesame street?,['anthropic-hh'],There are many educational and entertaining sh...,"Sure, I'd be happy to help! Here are a few sho...",5


Total rows: 109101
