## RandomDataSelect

**RandomDataSelect** is a lightweight Python utility for extracting a random subset of H-1B visa applications from the U.S. Department of Labor’s disclosure dataset. It handles both filtering and sampling in one pass, then writes the result to disk in CSV format.

### Features
- **Load** the FY2019 H-1B Disclosure dataset (`.xlsx`)
- **Filter** to keep only records where `VISA_CLASS == "H-1B"`
- **Randomly sample** a configurable fraction (default: 15%) of the filtered records
- **Export** the sampled subset as a CSV file, ready for downstream analysis

### Requirements
- Python 3.7+
- pandas
- openpyxl

Install dependencies with:
```bash
pip install pandas openpyxl
## RandomDataSelect

**RandomDataSelect** is a lightweight Python utility for extracting a random subset of H-1B visa applications from the U.S. Department of Labor’s disclosure dataset. It handles both filtering and sampling in one pass, then writes the result to disk in CSV format.

### Features
- **Load** the FY2019 H-1B Disclosure dataset (`.xlsx`)
- **Filter** to keep only records where `VISA_CLASS == "H-1B"`
- **Randomly sample** a configurable fraction (default: 15%) of the filtered records
- **Export** the sampled subset as a CSV file, ready for downstream analysis

### Requirements
- Python 3.7+
- pandas
- openpyxl

Install dependencies with:
```bash
pip install pandas openpyxl


In [1]:
import pandas as pd

# Load the dataset
file_path = 'H-1B_Disclosure_Data_FY2019.xlsx'
df = pd.read_excel(file_path, engine='openpyxl')

# Show initial structure and row count
print(f"Initial dataset shape: {df.shape}")
print("Columns:", df.columns.tolist())

# Filter for H-1B visa applications
# Adjust column name as necessary; common column is 'VISA_CLASS'
df_filtered = df[df['VISA_CLASS'] == 'H-1B']
print(f"Filtered for H-1B applications: {df_filtered.shape[0]} rows remaining")

# Random sampling: 15% of filtered data
df_sample = df_filtered.sample(frac=0.15, random_state=42)
print(f"Sampled dataset (15%): {df_sample.shape[0]} rows")

# Preview sampled data
df_sample.head(10)


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Initial dataset shape: (664616, 260)
Columns: ['CASE_NUMBER', 'CASE_STATUS', 'CASE_SUBMITTED', 'DECISION_DATE', 'ORIGINAL_CERT_DATE', 'VISA_CLASS', 'JOB_TITLE', 'SOC_CODE', 'SOC_TITLE', 'FULL_TIME_POSITION', 'PERIOD_OF_EMPLOYMENT_START_DATE', 'PERIOD_OF_EMPLOYMENT_END_DATE', 'TOTAL_WORKER_POSITIONS', 'NEW_EMPLOYMENT', 'CONTINUED_EMPLOYMENT', 'CHANGE_PREVIOUS_EMPLOYMENT', 'NEW_CONCURRENT_EMPLOYMENT', 'CHANGE_EMPLOYER', 'AMENDED_PETITION', 'EMPLOYER_NAME', 'EMPLOYER_BUSINESS_DBA', 'EMPLOYER_ADDRESS1', 'EMPLOYER_ADDRESS2', 'EMPLOYER_CITY', 'EMPLOYER_STATE', 'EMPLOYER_POSTAL_CODE', 'EMPLOYER_COUNTRY', 'EMPLOYER_PROVINCE', 'EMPLOYER_PHONE', 'EMPLOYER_PHONE_EXT', 'NAICS_CODE', 'AGENT_REPRESENTING_EMPLOYER', 'AGENT_ATTORNEY_LAW_FIRM_BUSINESS_NAME', 'AGENT_ATTORNEY_ADDRESS1', 'AGENT_ATTORNEY_ADDRESS2', 'AGENT_ATTORNEY_CITY', 'AGENT_ATTORNEY_STATE', 'AGENT_ATTORNEY_POSTAL_CODE', 'AGENT_ATTORNEY_COUNTRY', 'AGENT_ATTORNEY_PROVINCE', 'AGENT_ATTORNEY_PHONE', 'AGENT_ATTORNEY_PHONE_EXT', 'STATE_OF_HI

Unnamed: 0,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,ORIGINAL_CERT_DATE,VISA_CLASS,JOB_TITLE,SOC_CODE,SOC_TITLE,FULL_TIME_POSITION,...,PW_OTHER_SOURCE_10,PW_NON-OES_YEAR_10,PW_SURVEY_PUBLISHER_10,PW_SURVEY_NAME_10,H-1B_DEPENDENT,WILLFUL_VIOLATOR,SUPPORT_H1B,STATUTORY_BASIS,MASTERS_EXEMPTION,PUBLIC_DISCLOSURE
60320,I-200-19137-737062,CERTIFIED,2019-05-22 16:20:53,2019-05-29 22:02:00,NaT,H-1B,AUTOMATION ENGINEER,17-2199,"ENGINEERS, ALL OTHER",Y,...,,,,,N,N,,,,PLACE OF BUSINESS
49766,I-200-19079-369286,CERTIFIED,2019-03-20 05:31:12,2019-03-26 22:01:03,NaT,H-1B,SOFTWARE ENGINEER,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",Y,...,,,,,N,N,,,,PLACE OF BUSINESS
156296,I-200-19071-959843,CERTIFIED,2019-03-13 17:59:03,2019-03-19 22:04:57,NaT,H-1B,SR. SOFTWARE ENGINEER,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",Y,...,,,,,N,N,,,,PLACE OF BUSINESS
266983,I-200-19039-234107,CERTIFIED,2019-02-08 14:33:19,2019-02-14 22:01:01,NaT,H-1B,SALESFORCE DEVELOPER,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",Y,...,,,,,Y,N,Y,BOTH,,PLACE OF BUSINESS
520742,I-200-19071-830726,CERTIFIED,2019-03-13 12:41:08,2019-03-19 22:03:02,NaT,H-1B,"MANAGER, DIGITAL ANALYTICS",15-2041,STATISTICIANS,Y,...,,,,,N,N,,,,PLACE OF BUSINESS
305468,I-200-19067-495656,CERTIFIED,2019-03-08 19:25:55,2019-03-14 22:04:02,NaT,H-1B,COMPUTER TRAINING SPECIALIST,13-1151,TRAINING AND DEVELOPMENT SPECIALISTS,Y,...,,,,,Y,N,N,,,PLACE OF BUSINESS
479612,I-200-19242-141172,CERTIFIED,2019-08-30 11:32:35,2019-09-06 22:01:00,NaT,H-1B,IT PROJECT MANAGER 3,15-1199,"COMPUTER OCCUPATIONS, ALL OTHER",Y,...,,,,,Y,N,Y,WAGE,,PLACE OF BUSINESS
631189,I-200-18312-002398,CERTIFIED,2018-11-09 12:24:54,2018-11-16 22:00:23,NaT,H-1B,ANALYST,15-1121,COMPUTER SYSTEMS ANALYSTS,Y,...,,,,,Y,N,Y,,,PLACE OF BUSINESS
8993,I-200-19071-532008,CERTIFIED,2019-03-12 23:25:59,2019-03-18 22:06:01,NaT,H-1B,RF ENGINEER,17-2071,ELECTRICAL ENGINEERS,Y,...,,,,,N,N,,,,PLACE OF BUSINESS
84828,I-200-19046-974283,CERTIFIED,2019-03-05 15:58:42,2019-03-11 22:03:36,NaT,H-1B,SENIOR BUSINESS ANALYST,15-1121,COMPUTER SYSTEMS ANALYSTS,Y,...,,,,,N,N,,,,PLACE OF BUSINESS


In [2]:
# Write to a new CSV file
df_sample.to_csv('H1B_RandomSamplingData.csv', index=False)