## Exercise

Exploring Students' Performance Dataset - Numerical Techniques

**Objective**: The objective of this exercise is to practice using **numerical techniques** to analyze the Students' Performance dataset and gain insights into the students' academic performance.

**Dataset Description**:
The Students' Performance dataset contains information about students' demographic attributes, such as gender, race/ethnicity, parental education, lunch type, and test scores in three subjects: Math, Reading, and Writing.

**Exercise Steps**:

- Load the Dataset: Import the necessary libraries and load the Students' Performance dataset into a pandas DataFrame.

- Explore the Dataset: Use basic pandas functions to get an overview of the dataset, including the number of rows and columns, and number of unique values for each column. For those columns that have less than 10 distinct values, show those unique values. *Hint: look at the previous lesson. There you can find the functions or methods you need to use.**

- Analyze Descriptive Statistics: Calculate and interpret descriptive statistics, including measures of central tendency (mean, median, mode) and dispersion (standard deviation, range) for the numerical variables, and frequency counts for categorical variables.

### Loading the Dataset

In [1]:
# Dataset source URL
url = "https://raw.githubusercontent.com/data-bootcamp-v4/prework_data/main/students_performance.csv"

In [2]:
import pandas as pd
sp = pd.read_csv(url)

### Exploring Dataset

In [3]:
sp.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [5]:
# Number of rows and columns
sp.shape

(1000, 8)

In [6]:
# Number of unique values for each column
sp.nunique()

gender                          2
race/ethnicity                  5
parental level of education     6
lunch                           2
test preparation course         2
math score                     81
reading score                  72
writing score                  77
dtype: int64

In [23]:
# Unique values of columns with less than 10 values
columnsless10 = ["gender", "race/ethnicity", "parental level of education", "lunch", "test preparation course"]

for column in columnsless10:
    print(column)
    print(sp[column].unique())
    print("")

gender
['female' 'male']

race/ethnicity
['group B' 'group C' 'group A' 'group D' 'group E']

parental level of education
["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']

lunch
['standard' 'free/reduced']

test preparation course
['none' 'completed']



### Analyzing Descriptive Statistics
#### Numerical variables
##### Measures of central tendency (mean, median, mode)

In [22]:
numvars = ["math score", "reading score", "writing score"]

print("Mean:")
print(sp[numvars].mean())
print("")
print("Median:")
print(sp[numvars].median())
print("")
print("Mode:")
print(sp[numvars].mode())

Mean:
math score       66.089
reading score    69.169
writing score    68.054
dtype: float64

Median:
math score       66.0
reading score    70.0
writing score    69.0
dtype: float64

Mode:
   math score  reading score  writing score
0          65             72             74


##### Measures of dispersion (standard deviation, range)

In [24]:
print("Standard deviation:")
print(sp[numvars].std())
print("")
print("Range:")
print(sp[numvars].max() - sp[numvars].min())

Standard deviation:
math score       15.163080
reading score    14.600192
writing score    15.195657
dtype: float64

Range:
math score       100
reading score     83
writing score     90
dtype: int64


##### Summary of numerical variables

In [27]:
sp[numvars].describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


##### Numerical variables interpretation
Asumming for all subjects that possible scores go from 0 (worst) to 100 (best), and the minimum passing score is 50:
* The average scores for all subjects tend to be in between 66 and 70.
    * Mean and median are almost the same for all subjects, which indicates that there's no big groups in either extreme of the ranges.
* Better overall results in reading and writing than maths.
    * Higher modes in Reading and Writing, as well as a smaller range.
    * Looking at the summary, we can also see that students have gotten minimum scores in those subjects of 17 and 10 respectively, compared to at least one student getting a score of 0 in Math.
* Standard Deviation is similar for all subjects, with Reading having a slighly smaller one, suggesting students receive more similar grades. This also makes sense considering it is the subject with the smallest range.
* At least 75% of students pass in all subjects.

#### Categorical variables
##### Frequency counts

In [31]:
# Categorical variables are in this dataset also the ones that had less than 10 values
sp[columnsless10]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course
0,female,group B,bachelor's degree,standard,none
1,female,group C,some college,standard,completed
2,female,group B,master's degree,standard,none
3,male,group A,associate's degree,free/reduced,none
4,male,group C,some college,standard,none
...,...,...,...,...,...
995,female,group E,master's degree,standard,completed
996,male,group C,high school,free/reduced,none
997,female,group C,high school,free/reduced,completed
998,female,group D,some college,standard,completed


In [46]:
for column in columnsless10:
    print(sp[column].value_counts())
    print("")

gender
female    518
male      482
Name: count, dtype: int64

race/ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64

lunch
standard        645
free/reduced    355
Name: count, dtype: int64

test preparation course
none         642
completed    358
Name: count, dtype: int64



##### Numerical variables interpretation
* Gender: out of 1000 students, 518 are female and 482 are male.
* Race / Ethnicity: we see significant differences between the groups, with the majority of students being part of groups C and D (581 students together), a second less represented cohort being groups B and E (330 students together), and a minoritary group A (only 89 students).
* Parental level of education: most parents have finished education up to in between high school and associate's degree levels. The parents of 179 students have finished "some high school". The minority of student's parents have finished bachelor's or above, 177, with only 59 of those having finished a master's.
* Lunch: Almost two thirds of the students get standard lunch (645), while the other third gets free/reduced lunch (355).
* Test preparation course: very similar proportions to the lunch variable, two thirds of the student haven't taken the test preparation course (642), while the other third completed the course (358). This correlation could, after further study, signal social inequalities.