This Jupyter Notebook is primarily used to generate graphical visualizations and extract initial insights from the chosen dataset.

The detailed analysis and interpretation are discussed in the report.

In [1]:
!pip install ydata-profiling autoviz    # Install the required libraries for the analysis

Collecting ydata-profiling
  Downloading ydata_profiling-4.11.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting autoviz
  Downloading autoviz-0.1.905-py3-none-any.whl.metadata (14 kB)
Collecting pydantic>=2 (from ydata-profiling)
  Downloading pydantic-2.9.2-py3-none-any.whl.metadata (149 kB)
Collecting visions<0.7.7,>=0.7.5 (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling)
  Downloading visions-0.7.6-py3-none-any.whl.metadata (11 kB)
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.4-cp311-cp311-win_amd64.whl.metadata (5.6 kB)


In [5]:
import pandas as pd

# Read the data from a csv file, with a ; separator
df = pd.read_csv("student-por.csv", sep=";")
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


In [6]:
df.shape    # (rows, columns) of the data

(649, 33)

A table is presented here with the full description of the variables.

[https://archive.ics.uci.edu/dataset/320/student+performance](https://archive.ics.uci.edu/dataset/320/student+performance) also describes the meaning of each variable.

| Variable Name | Small description|
| :-:|:-|
| 'school'| student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)|
| 'sex'| student's sex (binary: 'F' - female or 'M' - male)|
| 'age'| student's age (numeric: from 15 to 22)|
| 'address'| student's home address type (binary: 'U' - urban or 'R' - rural)|
| 'famsize'| family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)|
| 'Pstatus'| parent's cohabitation status (binary: 'T' - living together or 'A' - apart)|
| 'Medu'| mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)|
| 'Fedu'| father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)|
| 'Mjob'| mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')|
| 'Fjob'| father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')|
| 'reason'| reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')|
| 'guardian'| student's guardian (nominal: 'mother', 'father' or 'other')|
| 'traveltime'| home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)|
| 'studytime'| weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)|
| 'failures'| number of past class failures (numeric: n if 1<=n<3, else 4)|
| 'schoolsup'| extra educational support (binary: yes or no)|
| 'famsup'| family educational support (binary: yes or no)|
| 'paid'| extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)|
| 'activities'| extra-curricular activities (binary: yes or no)|
| 'nursery'| attended nursery school (binary: yes or no)|
| 'higher'| wants to take higher education (binary: yes or no)|
| 'internet'| Internet access at home (binary: yes or no)|
| 'romantic'| with a romantic relationship (binary: yes or no)|
| 'famrel'| quality of family relationships (numeric: from 1 - very bad to 5 - excellent)|
| 'freetime'| free time after school (numeric: from 1 - very low to 5 - very high)|
| 'goout'| going out with friends (numeric: from 1 - very low to 5 - very high)|
| 'Dalc'| workday alcohol consumption (numeric: from 1 - very low to 5 - very high)|
| 'Walc'| weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)|
| 'health'| current health status (numeric: from 1 - very bad to 5 - very good)|
| 'absences'| number of school absences (numeric: from 0 to 93)|
| 'G1'| first period grade (numeric: from 0 to 20)|
| 'G2'| second period grade (numeric: from 0 to 20)|
| 'G3'| final grade (numeric: from 0 to 20, output target)|


In [None]:
df[df.duplicated()] # Check how many and which rows are duplicated

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3


In [None]:
# For each column, show their unique values
unique_values_series = pd.Series({col: sorted(df[col].unique().tolist()) for col in df.columns})
print(unique_values_series)

school                                                 [GP, MS]
sex                                                      [F, M]
age                            [15, 16, 17, 18, 19, 20, 21, 22]
address                                                  [R, U]
famsize                                              [GT3, LE3]
Pstatus                                                  [A, T]
Medu                                            [0, 1, 2, 3, 4]
Fedu                                            [0, 1, 2, 3, 4]
Mjob                [at_home, health, other, services, teacher]
Fjob                [at_home, health, other, services, teacher]
reason                        [course, home, other, reputation]
guardian                                [father, mother, other]
traveltime                                         [1, 2, 3, 4]
studytime                                          [1, 2, 3, 4]
failures                                           [0, 1, 2, 3]
schoolsup                               

In [None]:
df.isna().sum().sum()   # Check how many values are missing

0

In [None]:
df.describe()   # Show the statistics for every numerical variable

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0,649.0
mean,16.744222,2.514638,2.306626,1.568567,1.930663,0.22188,3.930663,3.180277,3.1849,1.502311,2.280431,3.53621,3.659476,11.399076,11.570108,11.906009
std,1.218138,1.134552,1.099931,0.74866,0.82951,0.593235,0.955717,1.051093,1.175766,0.924834,1.28438,1.446259,4.640759,2.745265,2.913639,3.230656
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,16.0,2.0,1.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,2.0,0.0,10.0,10.0,10.0
50%,17.0,2.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,2.0,11.0,11.0,12.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,6.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,32.0,19.0,19.0,19.0


In [None]:
df.describe(include='object')   # Show statistics for each categorical variable

Unnamed: 0,school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic
count,649,649,649,649,649,649,649,649,649,649,649,649,649,649,649,649,649
unique,2,2,2,2,2,5,5,4,3,2,2,2,2,2,2,2,2
top,GP,F,U,GT3,T,other,other,course,mother,no,yes,no,no,yes,yes,yes,no
freq,423,383,452,457,569,258,367,285,455,581,398,610,334,521,580,498,410


In [3]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Students Performance Report")    # Create the ydata-profiling report
profile.to_file("report.html")                                      # Save the report
profile                                                           # Show the report in the notebook

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]



In [4]:
from autoviz import AutoViz_Class
AV = AutoViz_Class()

# Create and show the report created by AutoViz
# Only analyses the first 30 columns (will exclude columns `G1`, `G2` and `G3`)
dft = AV.AutoViz(
    "",
    sep=",",
    depVar="",
    dfte=df,
    header=0,
    verbose=1,
    lowess=False,
    chart_format="bokeh",   # plot type for notebooks
    save_plot_dir=None
)

Imported v0.1.905. Please call AutoViz in this sequence:
    AV = AutoViz_Class()
    %matplotlib inline
    dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
               chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)
Shape of your Data Set loaded: (649, 33)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
    Number of Numeric Columns =  0
    Number of Integer-Categorical Columns =  16
    Number of String-Categorical Columns =  4
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  13
    Number of Numeric-Boolean Columns =  0
    Number of Discrete String Columns =  0
    Number of NLP String Columns =  0
    Number o

No date vars could be found in data set


Time to run AutoViz (in seconds) = 8
