# Pandas Student Notebook — Easy Foundations  
## Dataset: Kaggle “Students Performance in Exams”

### Goal of this notebook
This notebook is meant as a **gentle introduction** to Pandas for beginners.
The focus is on:
- reading data
- selecting columns
- simple filtering
- basic groupby
- creating very simple new columns

There are **no tricky edge cases** in this notebook.
Work slowly and check your results often.


## 0. Setup

Load `StudentsPerformance.csv` into a DataFrame called `df`.

Then show:
- the number of rows and columns
- the first 5 rows
- the datatypes, total size, total rows and number of missing values per column


In [1]:
import pandas as pd

df = pd.read_csv('data/student-por.csv')

In [2]:
df.shape

(1000, 8)

In [3]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


## 1. Looking at the data

1) Print the column names.  
2) Use `df.describe()` to see summary statistics.  

Question (write as a comment):
- What kind of values do the score columns contain?


In [5]:
df.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

In [6]:
df.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


## 2. Selecting columns

Create a new DataFrame `scores` that contains only:
- `math score`
- `reading score`
- `writing score`

Show the first 5 rows.

Note: Continue the rest of the notebook with df, not scores


In [7]:
scores = df[['math score', 'reading score', 'writing score']].copy()
scores.head()

Unnamed: 0,math score,reading score,writing score
0,72,72,74
1,69,90,88
2,90,95,93
3,47,57,44
4,76,78,75


## 3. Simple filtering

1) Filter students who scored more than 70 in math.  
2) From those students, show only their `math score` and `reading score`.

Show the first 10 rows.


In [8]:
top_students = df[(df['math score'] > 70)]
top_students[['math score', 'reading score']].head(10)

Unnamed: 0,math score,reading score
0,72,72
2,90,95
4,76,78
5,71,83
6,88,95
13,78,72
16,88,89
24,74,71
25,73,74
34,97,87


## 4. Creating new columns

Create:
- `average_score` = average of math, reading, and writing scores
- `passed` = True if `average_score >= 60`, else False

Show the first 5 rows


In [9]:
df.loc[:, 'average_score'] = df[['math score', 'reading score', 'writing score']].mean(axis=1)

df.loc[:,'passed'] = df['average_score'] > 60

df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,average_score,passed
0,female,group B,bachelor's degree,standard,none,72,72,74,72.666667,True
1,female,group C,some college,standard,completed,69,90,88,82.333333,True
2,female,group B,master's degree,standard,none,90,95,93,92.666667,True
3,male,group A,associate's degree,free/reduced,none,47,57,44,49.333333,False
4,male,group C,some college,standard,none,76,78,75,76.333333,True


## 5. Value counts

- Count how many students passed vs failed.

Question (comment):
- Is this a count or a proportion?


In [10]:
df['passed'].value_counts()
# This is a count, a proportion would be df['passed'].value_counts()/len(df)

passed
True     707
False    293
Name: count, dtype: int64

## 6. Groupby basics

Compute the **average math score** by:
- `gender`

Then compute the **average reading score** by:
- `parental level of education`

Show both results.


In [11]:
df.groupby('gender')['math score'].mean()

gender
female    63.633205
male      68.728216
Name: math score, dtype: float64

In [12]:
df.groupby('parental level of education')['reading score'].mean()


parental level of education
associate's degree    70.927928
bachelor's degree     73.000000
high school           64.704082
master's degree       75.372881
some college          69.460177
some high school      66.938547
Name: reading score, dtype: float64

## 7. Capstone: clean summary table

Create a DataFrame `summary` with the
- average math score
- average reading score
- average writing score

per gender

Show the first 5 rows

In [13]:
summary = df.groupby('gender')[['math score', 'reading score', 'writing score']].mean()
summary.head()

Unnamed: 0_level_0,math score,reading score,writing score
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,63.633205,72.608108,72.467181
male,68.728216,65.473029,63.311203
