# Exploratory Data Analysis

In [29]:
import pandas as pd
pd.set_option("display.max_rows", 15, "display.max_columns", None)

At the very first of the exploratory analysis process, it is always helpful to take a glance at the structures and details of the raw dataset we archived from the [UCI ML Repo](https://archive-beta.ics.uci.edu/ml/datasets/student+performance).

In [41]:
pd.read_csv('../data/raw/student-mat.csv', sep = ";").iloc[:,1:]

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,course,other,1,2,2,no,yes,yes,no,yes,yes,no,no,5,5,4,4,5,4,11,9,9,9
391,MS,M,17,U,LE3,T,3,1,services,services,course,mother,2,1,0,no,no,no,no,no,yes,yes,no,2,4,5,3,4,2,3,14,16,16
392,MS,M,21,R,GT3,T,1,1,other,other,course,other,1,1,3,no,no,no,no,no,yes,no,no,5,5,3,3,3,3,3,10,8,7
393,MS,M,18,R,LE3,T,3,2,services,other,course,mother,3,1,0,no,no,no,no,no,yes,yes,no,4,4,1,3,4,5,0,11,12,10


Then, we decided on a train test split of 80% training and 20% testing. This was because the number of samples in the dataset where moderately low (396) so we wanted more samples to train on. Using panda's describe all and styler to make a table, we can find many useful pieces of information for all of our features inside the training set. For numerical features, we can see interesting pieces of info such as the average mother education is higher than the average father education. Furthermore, for categorical variables, we can see interesting info such as the most frequent parent status in our training set is that they are still living together and most people are not in a romantic relationship.

In [36]:
pd.read_csv('../results/exploratory-stu-mat.csv', sep = ",")

Unnamed: 0.1,Unnamed: 0,studytime,Pstatus,Medu,Fedu,Mjob,Fjob,goout,romantic,traveltime
0,count,316.0,316,316.0,316.0,316,316,316.0,316,316.0
1,unique,,2,,,5,5,,2,
2,top,,T,,,other,other,,no,
3,freq,,290,,,115,174,,209,
4,mean,2.047468,,2.797468,2.547468,,,3.120253,,1.436709
5,std,0.843816,,1.067616,1.090053,,,1.091715,,0.680182
6,min,1.0,,1.0,0.0,,,1.0,,1.0
7,25%,1.0,,2.0,2.0,,,2.0,,1.0
8,50%,2.0,,3.0,2.5,,,3.0,,1.0
9,75%,2.0,,4.0,4.0,,,4.0,,2.0


From the numeric features represented in {numref}`Figure {number}: {name} <num-fig>`, we can see that for the most part, there is no relation between these features and predicted grade. The only truly interesting thing to note is that with higher travel time, it seems that the range of grades gets narrower and narrower such that the low end of the range is higher than lower values of travel time, but the high end of the range is also much lower compared to lower values of travel time. Of course, it is difficult to say whether this is true or not given the low number of samples for higher travel time.

:::{figure-md} num-fig
<img src="../results/figures/explore_numeric.png" alt="num" class="bg-primary mb-1" width="550px">

A series of plots examining the numeric features compared to predicted grade
:::

From the exploratory categorical variable analysis indicated by {numref}`Figure {number}: {name} <cat-fig>` , we can see that for some of these variables, we have a big imbalance between classes. This is especially prominent in P status and Father job. The consequence of this is that the coefficient end up not being very useful in terms of predicting grades as, for example, P status = t may have lots of representation in high and low categories. Furthermore, given the low amount of P status = A, the model we use might misrepresent the data if all of the "A" values end up being either high or low grades, and not a mix of both when we apply our model to the test set.

:::{figure-md} cat-fig
<img src="../results/figures/explore_cat.png" alt="cat" class="bg-primary mb-1" width="550px">

A series of plots examining the categorical features compared to predicted grade
:::