<a href="https://colab.research.google.com/github/BreakoutMentors/Data-Science-and-Machine-Learning/blob/main/datasets/Student_Athletes_Synthetic_Datset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The code below was written by SJ. Kai and SJ developed the structure used to create a synthetic dataset relevent to middle and high school age students. The dataset is split into three component tables:


1.   students (this is the dimension / base dataset, each row is a student and each student only appears once)
2.   student_academics (this is a fact table, each row is a student, their grade level, and the course they took; students can appear more than once in this table since they can take multiple course at each grade level)
3.   student_atheletes (this is a fact table, each row is a student, their age, and sports data; students can appear more than once in this table since they can be associated with multiple sports at each age)

This synthetic data is meant to be used to help students learn how to manipulate and visualize data in Python for data science and ML. The data was generated to ensure correlations between variables exist and students have opportunities to clean and expand on the data, and ultimately share insightful discoveries from analyzing it.


In [8]:
!pip install Faker==18.10.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Students Dataset

In [2]:
import pandas as pd
import numpy as np
from faker import Faker
import random

fake = Faker()

num_students = 5000
grade = np.random.randint(9, 13, size=num_students)
age = (grade + 5.5 + np.random.random(size=num_students)).astype(int)
gender = np.random.choice(['Male', 'Female'], size=num_students)  # Gender of the students
name = [fake.name_male() if g == 'Male' else fake.name_female() for g in gender]  # Generate names based on gender
id = [str(i).zfill(len(str(num_students))) for i in range(0, num_students)]

students = pd.DataFrame({
    'name': name,
    'age': age,
    'grade': grade,
    'gender': gender,
    'id': id
})
students = students.set_index('id')
students.to_csv("students.csv")

In [3]:
students

Unnamed: 0_level_0,name,age,grade,gender
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0000,Shawn King,16,11,Male
0001,Megan Irwin,15,9,Female
0002,Brian Mendoza,15,9,Male
0003,Charlotte Parrish,15,10,Female
0004,Ashley Duran,17,12,Female
...,...,...,...,...
4995,Jonathan Holmes,15,9,Male
4996,Tony Decker,17,12,Male
4997,Patricia Perez,15,9,Female
4998,Lori Kim,15,10,Female


# Academic Performance

In [4]:
subjects = {
    'math': ['Algebra 1', 'Geometry', 'Algebra 2', 'Precalculus', 'Calculus'],
    'science': ['Biology', 'Chemistry', 'Physics', 'Anatomy', 'Environmental Science', 'Astronomy'],
    'social studies': ['World Geography', 'World History', 'American History','American Government', 'Economics'],
    'language arts': ['English 1', 'English 2', 'English 3', 'English 4'],
    'elective': ['Physical Education', 'Computer Science', 'Studio Art', 'Music', 'Cooking', 'Yearbook']
}

def freshman_schedule(student_id):
  name = students.loc[student_id, 'name']
  grade = 9
  math = subjects['math'][random.randint(0, 1)]
  math_grade = random.randint(70, 100)
  science = subjects['science'][0]
  science_grade = 70 + int(np.clip(36.5 * math_grade/100 * random.random(), 0, 30)) # correlate science and math
  social_studies = subjects['social studies'][0]
  social_studies_grade = random.randint(70, 100)
  language_arts = subjects['language arts'][0]
  language_arts_grade = 70 + int(np.clip(36.5 * social_studies_grade/100 * random.random(), 0, 30))  # correlate language arts and social studies
  elective = random.choice(subjects['elective'])
  elective_grade = random.randint(70, 100)
  return {
      'id': student_id,
      'name': name,
      'grade': grade,
      'math': (math, math_grade),
      'science': (science, science_grade),
      'social studies': (social_studies, social_studies_grade),
      'language arts': (language_arts, language_arts_grade),
      'elective': (elective, elective_grade)
  }

def sophomore_schedule(previous_schedule):
  student_id = previous_schedule['id']
  name = students.loc[student_id, 'name']
  grade = 10
  math_level = subjects['math'].index(previous_schedule['math'][0]) + 1
  math = subjects['math'][math_level]
  math_grade = np.clip(previous_schedule['math'][1] + random.randint(-5, 5), 70, 100)
  science = subjects['science'][1]
  science_grade = np.clip(previous_schedule['science'][1] + random.randint(-5, 5), 70, 100)
  social_studies = subjects['social studies'][1]
  social_studies_grade = np.clip(previous_schedule['social studies'][1] + random.randint(-5, 5), 70, 100)
  language_arts = subjects['language arts'][1]
  language_arts_grade = np.clip(previous_schedule['language arts'][1] + random.randint(-5, 5), 70, 100)
  elective = random.choice(subjects['elective'])
  elective_grade = np.clip(previous_schedule['elective'][1] + random.randint(-5, 5), 70, 100)
  return {
      'id': student_id,
      'name': name,
      'grade': grade,
      'math': (math, math_grade),
      'science': (science, science_grade),
      'social studies': (social_studies, social_studies_grade),
      'language arts': (language_arts, language_arts_grade),
      'elective': (elective, elective_grade)
  }

def junior_schedule(previous_schedule):
  student_id = previous_schedule['id']
  name = students.loc[student_id, 'name']
  grade = 11
  math_level = subjects['math'].index(previous_schedule['math'][0]) + 1
  math = subjects['math'][math_level]
  math_grade = np.clip(previous_schedule['math'][1] + random.randint(-5, 5), 70, 100)
  science = subjects['science'][2]
  science_grade = np.clip(previous_schedule['science'][1] + random.randint(-5, 5), 70, 100)
  social_studies = subjects['social studies'][2]
  social_studies_grade = np.clip(previous_schedule['social studies'][1] + random.randint(-5, 5), 70, 100)
  language_arts = subjects['language arts'][2]
  language_arts_grade = np.clip(previous_schedule['language arts'][1] + random.randint(-5, 5), 70, 100)
  elective = random.choice(subjects['elective'])
  elective_grade = np.clip(previous_schedule['elective'][1] + random.randint(-5, 5), 70, 100)
  return {
      'id': student_id,
      'name': name,
      'grade': grade,
      'math': (math, math_grade),
      'science': (science, science_grade),
      'social studies': (social_studies, social_studies_grade),
      'language arts': (language_arts, language_arts_grade),
      'elective': (elective, elective_grade)
  }

def senior_schedule(previous_schedule):
  student_id = previous_schedule['id']
  name = students.loc[student_id, 'name']
  grade = 12
  math_level = subjects['math'].index(previous_schedule['math'][0]) + 1
  math = subjects['math'][math_level]
  math_grade = np.clip(previous_schedule['math'][1] + random.randint(-5, 5), 70, 100)
  science = random.choice(subjects['science'][-3:]) # pick random last science
  science_grade = np.clip(previous_schedule['science'][1] + random.randint(-5, 5), 70, 100)
  social_studies = random.choice(subjects['social studies'][-2:]) # pick random last science
  social_studies_grade = np.clip(previous_schedule['social studies'][1] + random.randint(-5, 5), 70, 100)
  language_arts = subjects['language arts'][3]
  language_arts_grade = np.clip(previous_schedule['language arts'][1] + random.randint(-5, 5), 70, 100)
  elective = random.choice(subjects['elective'])
  elective_grade = np.clip(previous_schedule['elective'][1] + random.randint(-5, 5), 70, 100)
  return {
      'id': student_id,
      'name': name,
      'grade': grade,
      'math': (math, math_grade),
      'science': (science, science_grade),
      'social studies': (social_studies, social_studies_grade),
      'language arts': (language_arts, language_arts_grade),
      'elective': (elective, elective_grade)
  }

def generate_student(student_id):
  grade_level = students.loc[student_id, 'grade']

  student_grades = [freshman_schedule(student_id)]

  if grade_level > 9:
    student_grades.append(sophomore_schedule(student_grades[-1]))
  if grade_level > 10:
    student_grades.append(junior_schedule(student_grades[-1]))
  if grade_level > 11:
    student_grades.append(senior_schedule(student_grades[-1]))

  return student_grades

def generate_many_student(student_id_list):
  student_grades = []
  for id in student_id_list:
    student_grades.extend(generate_student(id))
  return student_grades

def format_for_kai():
  student_grades = pd.DataFrame(generate_many_student(list(students.index)))
  course_list = []
  for index, row in student_grades.iterrows():
    math = {'id': row['id'], 'name': row['name'], 'grade level': row['grade'], 'course': row['math'][0], 'course score': row['math'][1]}
    science = {'id': row['id'], 'name': row['name'], 'grade level': row['grade'], 'course': row['science'][0], 'course score': row['science'][1]}
    social_studies = {'id': row['id'], 'name': row['name'], 'grade level': row['grade'], 'course': row['social studies'][0], 'course score': row['social studies'][1]}
    language_arts = {'id': row['id'], 'name': row['name'], 'grade level': row['grade'], 'course': row['language arts'][0], 'course score': row['language arts'][1]}
    elective = {'id': row['id'], 'name': row['name'], 'grade level': row['grade'], 'course': row['elective'][0], 'course score': row['elective'][1]}
    course_list.extend([math, science, social_studies, language_arts, elective])
  return pd.DataFrame(course_list)

academics = format_for_kai()
academics.to_csv("student_academics.csv")

In [5]:
academics

Unnamed: 0,id,name,grade level,course,course score
0,0000,Shawn King,9,Algebra 1,96
1,0000,Shawn King,9,Biology,97
2,0000,Shawn King,9,World Geography,97
3,0000,Shawn King,9,English 1,100
4,0000,Shawn King,9,Yearbook,97
...,...,...,...,...,...
62825,4999,Jennifer Jackson,9,Algebra 1,89
62826,4999,Jennifer Jackson,9,Biology,87
62827,4999,Jennifer Jackson,9,World Geography,76
62828,4999,Jennifer Jackson,9,English 1,79


## Want fewer courses?
The following randomly selects 1000 entries.

In [None]:
academics.sample(1000)

Unnamed: 0,id,name,grade level,course,course score
22731,1787,Robert Nichols,11,Physics,81
21161,1666,Mr. Jason Lee,11,Physics,99
61315,4861,Christine Holmes,10,Geometry,89
30292,2396,Ashley Griffin,9,World Geography,84
54370,4320,Kathleen Wiggins,9,Algebra 1,94
...,...,...,...,...,...
54897,4357,Monica Camacho,11,American History,76
8677,0681,Wendy Little,11,American History,82
39073,3093,Peter Love,10,English 2,90
57418,4554,Elizabeth Garcia,9,English 1,89


# Athletic Performance Dataset

In [None]:
correlation_strength_age = 0.3
correlation_strength_exp = 0.2

sports = ['Basketball', 'Soccer', 'Swimming', 'Track and Field']

def generate_first_year(student_id, sport, grade_level):
  age = students.loc[student_id, 'age'] - (students.loc[student_id, 'grade'] - grade_level)
  years_experience = 1
  if grade_level == 9:
    years_experience = random.randint(1, 4)
  hours_training = random.randint(3, 10) + correlation_strength_age * age + correlation_strength_exp * years_experience
  ranking = years_experience * hours_training * (random.random()*0.2 + 1.1)

  return {
      'id': student_id,
      'age': age,
      'sport': sport,
      'years experience': years_experience,
      'hours training': hours_training,
      'ranking': ranking
  }

def generate_next_year(previous_year):
  student_id = previous_year['id']
  age = previous_year['age'] + 1
  sport = previous_year['sport']
  years_experience = previous_year['years experience'] + 1
  hours_training = previous_year['hours training'] * (random.random()*.3 + 1.1)
  ranking = years_experience * hours_training * (random.random()*0.2 + 1.1)

  return {
      'id': student_id,
      'age': age,
      'sport': sport,
      'years experience': years_experience,
      'hours training': hours_training,
      'ranking': ranking
  }

def generate_athlete(student_id):
  grade_level = students.loc[student_id, 'grade']
  num_sports = random.randint(0, 3)
  sports_played = random.sample(sports, num_sports)

  athlete_years = []
  for sport in sports_played:
    first_year = random.randint(9, grade_level)
    athlete_years.append(generate_first_year(student_id, sport, first_year))
    for i in range(first_year, grade_level):
      athlete_years.append(generate_next_year(athlete_years[-1]))

  return athlete_years

def generate_many_athletes(student_id_list):
  student_athletes = []
  for student_id in student_id_list:
    student_athletes.extend(generate_athlete(student_id))

  return pd.DataFrame(student_athletes)

student_athletes = generate_many_athletes(list(students.index))
for sport in sports:
  student_athletes.loc[student_athletes['sport']==sport, 'ranking'] = student_athletes[student_athletes['sport']==sport]['ranking'].argsort().argsort() / student_athletes[student_athletes['sport']==sport]['ranking'].size
student_athletes['ranking'] = student_athletes['ranking'].round(2)
student_athletes['hours training'] = student_athletes['hours training'].round(2)

student_athletes.to_csv("student_athletes.csv")

In [None]:
student_athletes

Unnamed: 0,id,age,sport,years experience,hours training,ranking
0,0002,15,Swimming,3,9.10,0.50
1,0002,15,Track and Field,2,14.90,0.57
2,0002,15,Soccer,4,12.30,0.73
3,0003,14,Soccer,2,13.60,0.47
4,0003,14,Basketball,1,8.40,0.01
...,...,...,...,...,...,...
13118,4996,15,Track and Field,1,14.70,0.29
13119,4996,15,Soccer,1,14.70,0.29
13120,4997,16,Track and Field,1,15.00,0.33
13121,4998,15,Soccer,3,14.10,0.73


# Example Prompt Ideas:
- What kind of relationship do you expect years of experience and age has? (Single Variable Linear)
- The popularity of a student seems to be based on how talented they are. In this case, talent isn't determined by a single variable but 3: accuracy, speed, and vertical jump.
  - Create a new metric called talent that combines them based on their ranking in each metric (use .argsort().argsort()) then add them and plot it against their popularity.
  - Linear Relationship with 3 inputs
- Introduce concepts of percentiles with accuracy metric (.argsort().argsort())
- Accuracy increases with respect to hours Training at a non-linear rate (square root).