# College Student Dataset Creation

This notebook creates a comprehensive student dataset for a college with various student attributes including personal information, academic details, and enrollment information.

## Import Required Libraries

Import necessary libraries including pandas for data manipulation, numpy for numerical operations, and random for generating synthetic data.

In [7]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
import json

## Define Student Data Structure

Define the fields and structure for student records. Each student will have:
- **Student ID**: Unique identifier (1001-1500)
- **Name**: Full name of the student
- **Age**: Age range (18-25 years)
- **Major**: Field of study
- **GPA**: Grade point average (0.0-4.0)
- **Enrollment Year**: Year of enrollment
- **Phone**: Contact phone number
- **Email**: College email address

In [8]:
# Define lists for generating realistic student data
first_names = ["Amit", "Bhavna", "Chirag", "Deepak", "Esha", "Faisal", "Gina", "Harsh", 
               "Isha", "Jatin", "Kavya", "Laxman", "Meera", "Nikhil", "Olivia", "Priya",
               "Qasim", "Rajesh", "Sophia", "Tushar", "Uma", "Vikram", "Wanisha", "Xyza", "Yash"]

last_names = ["Singh", "Kumar", "Sharma", "Patel", "Khan", "Verma", "Chopra", "Gupta",
              "Malhotra", "Nair", "Rao", "Sinha", "Bhat", "Desai", "Iyer", "Menon"]

majors = ["Computer Science", "Electrical Engineering", "Mechanical Engineering", 
          "Civil Engineering", "Business Administration", "Finance", "Marketing",
          "Psychology", "Biology", "Chemistry", "Physics", "Mathematics"]

# Create lists for generating phone and email
def generate_email(first_name, last_name, student_id):
    return f"{first_name.lower()}.{last_name.lower()}{student_id}@college.edu"

def generate_phone():
    return f"98{random.randint(10000000, 99999999)}"

## Generate Student Records

Create synthetic student records with realistic values. We'll generate 500 student records with randomly assigned attributes.

In [9]:
# Generate student records
students = []
num_students = 500

for i in range(num_students):
    student_id = 1001 + i
    first_name = random.choice(first_names)
    last_name = random.choice(last_names)
    age = random.randint(18, 25)
    major = random.choice(majors)
    gpa = round(random.uniform(2.0, 4.0), 2)
    enrollment_year = random.choice([2021, 2022, 2023, 2024])
    phone = generate_phone()
    email = generate_email(first_name, last_name, student_id)
    
    student = {
        'Student_ID': student_id,
        'First_Name': first_name,
        'Last_Name': last_name,
        'Age': age,
        'Major': major,
        'GPA': gpa,
        'Enrollment_Year': enrollment_year,
        'Phone': phone,
        'Email': email
    }
    students.append(student)

print(f"Generated {len(students)} student records successfully!")
print(f"Student ID range: {students[0]['Student_ID']} to {students[-1]['Student_ID']}")

Generated 500 student records successfully!
Student ID range: 1001 to 1500


## Create DataFrame from Student Data

Convert the student records list into a pandas DataFrame for easier manipulation and analysis.

In [10]:
# Create a pandas DataFrame from the student records
df_students = pd.DataFrame(students)

# Display basic information about the DataFrame
print("Dataset Shape:", df_students.shape)
print("\nFirst few records:")
print(df_students.head(10))

print("\nDataFrame Info:")
print(df_students.info())

Dataset Shape: (500, 9)

First few records:
   Student_ID First_Name Last_Name  Age                   Major   GPA  \
0        1001     Chirag    Sharma   22        Computer Science  3.05   
1        1002     Sophia     Menon   20             Mathematics  3.95   
2        1003     Laxman     Kumar   18       Civil Engineering  3.94   
3        1004     Bhavna     Singh   21  Mechanical Engineering  3.53   
4        1005      Qasim      Khan   23  Electrical Engineering  3.70   
5        1006     Sophia    Chopra   25                 Biology  2.21   
6        1007     Rajesh     Kumar   20                 Finance  2.76   
7        1008     Sophia     Verma   25  Electrical Engineering  3.20   
8        1009     Vikram     Singh   20                 Finance  2.12   
9        1010     Olivia     Gupta   18                 Biology  2.37   

   Enrollment_Year       Phone                          Email  
0             2023  9871633791  chirag.sharma1001@college.edu  
1             2022  9830

## Save Dataset to File

Export the student dataset to CSV and JSON formats for future use and sharing.

In [11]:
# Save the dataset to CSV format
csv_filename = 'college_students.csv'
df_students.to_csv(csv_filename, index=False)
print(f"Dataset saved to {csv_filename}")

# Save the dataset to JSON format
json_filename = 'college_students.json'
df_students.to_json(json_filename, orient='records', indent=2)
print(f"Dataset saved to {json_filename}")

# Display file sizes
import os
csv_size = os.path.getsize(csv_filename) / 1024  # in KB
json_size = os.path.getsize(json_filename) / 1024  # in KB
print(f"\nFile sizes:")
print(f"CSV: {csv_size:.2f} KB")
print(f"JSON: {json_size:.2f} KB")

Dataset saved to college_students.csv
Dataset saved to college_students.json

File sizes:
CSV: 41.58 KB
JSON: 116.70 KB


## Display and Validate Dataset

Display sample data and perform validation checks to ensure data integrity and correctness.

In [12]:
# Display statistical summary of the dataset
print("Statistical Summary of the Dataset:")
print(df_students.describe())

print("\n" + "="*80)
print("Detailed Data Overview:")
print("="*80)

# Count students by major
print("\nStudents by Major:")
print(df_students['Major'].value_counts())

print("\nStudents by Enrollment Year:")
print(df_students['Enrollment_Year'].value_counts().sort_index())

print("\nAge Distribution:")
print(df_students['Age'].value_counts().sort_index())

# Data validation checks
print("\n" + "="*80)
print("Data Validation Checks:")
print("="*80)

print(f"Total number of students: {len(df_students)}")
print(f"Unique Student IDs: {df_students['Student_ID'].nunique()}")
print(f"Duplicate Student IDs: {df_students['Student_ID'].duplicated().sum()}")
print(f"Missing values in any column: {df_students.isnull().sum().sum()}")
print(f"GPA Range: {df_students['GPA'].min()} to {df_students['GPA'].max()}")
print(f"Age Range: {df_students['Age'].min()} to {df_students['Age'].max()}")

print("\nAll validation checks passed! Dataset is complete and valid.")

Statistical Summary of the Dataset:
        Student_ID         Age         GPA  Enrollment_Year
count   500.000000  500.000000  500.000000       500.000000
mean   1250.500000   21.264000    2.985380      2022.536000
std     144.481833    2.356568    0.591475         1.125717
min    1001.000000   18.000000    2.000000      2021.000000
25%    1125.750000   19.000000    2.445000      2022.000000
50%    1250.500000   21.000000    2.975000      2023.000000
75%    1375.250000   23.000000    3.490000      2024.000000
max    1500.000000   25.000000    3.990000      2024.000000

Detailed Data Overview:

Students by Major:
Major
Mechanical Engineering     56
Biology                    56
Computer Science           51
Electrical Engineering     45
Finance                    44
Business Administration    44
Psychology                 37
Chemistry                  37
Physics                    36
Civil Engineering          35
Marketing                  30
Mathematics                29
Name: count, 