# Pandas Data Analysis Example


Introduction

Pandas can be an incredible tool for making data analysis accessible. We are able to use this Python library to perform exploratory data analysis and find relationships in our data. In this guided project we will explore a dataset describing the performance in math of a sample of students from two school districts in Portugal. This dataset has been extracted from the UCI Machine Learning Repository.
Loading the Data

In this example we will be using Jupyter notebooks to analyze the data.

First we download the dataset [student-mat.csv](https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/data/module-2/student-mat.csv) then load it with Pandas:

In [1]:
import numpy as np
import pandas as pd

student = pd.read_csv('data/student-mat.csv')
student.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


Let's see how many columns this dataset has and what they contain:

In [2]:
student.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

We can see that there are 17 object columns (columns that contain strings) and 16 integer columns. We can learn more about the data from the data dictionary. In general, we should examine the data dictionary when examining a new dataset.

# Describing the Data

Let's look at the numeric columns in the dataset using the describe function:

In [3]:
student.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


Some observations regarding the data:

    The median age in this group is 17 (with a mean of 16.696).
    The median travel time is under 15 minutes.
    With one student having 75 absences and the mean being slightly larger than the median, we can assume the data is skewed and this student with 75 absences is an outlier
    The mean and median grade stayed pretty similar throughout the year with the mean fluctuating between 10 and 11 and the median fluctuating between 13 and 14.

We can also try to make some inferences regarding the variables that contain characters.

We can do this using the crosstab function. Using this function we can find the count of each value in the variable. We can also find the counts for multiple variables at once.

For example, we would like to know how many males and how many females are in this survey

In [4]:
pd.crosstab(index=student.sex, columns="count")

col_0,count
sex,Unnamed: 1_level_1
F,208
M,187


So here we see that there are slightly more males than females in the survey.

We can also look at the breakdown of students who participate in extracurricular activities by sex.

In [5]:
pd.crosstab(index=student.sex, columns=student.activities)

activities,no,yes
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
F,112,96
M,82,105


It seems like the proportion of males participating in extracurricular activities is larger.

We can also look at type of address vs. family size:

In [6]:
pd.crosstab(index=student.address, columns=student.famsize)

famsize,GT3,LE3
address,Unnamed: 1_level_1,Unnamed: 2_level_1
R,68,20
U,213,94


There are a lot less students living in rural areas and the proportion of them living in families larger than 3 is greater.


# Calculated Columns

We can add more to our inference by computing new columns using the existing data. For example, we would like to know how many students improved their grade between the first and second period. We create a calculated column and then count the number of students that improved and did not improve in each district.

In [7]:
student['improvement'] = np.where(student.G2 > student.G1, "improved", "did not improve")
pd.crosstab(index=student.school, columns=student.improvement)

improvement,did not improve,improved
school,Unnamed: 1_level_1,Unnamed: 2_level_1
GP,228,121
MS,39,7


The data shows that almost a third of students in Gabriel Periera improved.


# Pivot Tables

We can generate a pivot table of this data to provide us with a concise summary that will contain a large amount of insight.

Let's generate a pivot table that shows us the mean final grade by school, sex and weekly study time:



In [8]:
student.pivot_table(index=["school"], columns=["sex", "studytime"], values=["G3"], fill_value=0)

Unnamed: 0_level_0,G3,G3,G3,G3,G3,G3,G3,G3
sex,F,F,F,F,M,M,M,M
studytime,1,2,3,4,1,2,3,4
school,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
GP,10.652174,9.363636,10.590909,11,10.363636,11.090909,13.923077,11.7
MS,5.25,10.428571,11.571429,0,8.75,10.875,13.0,0.0


We can see that at Mousinho da Silveira there are no students studying the maximum amount of time.

Also, the fact that students who study 5-10 hours a week are more successful than students who study more than 10 hours per week could be attributed to the small sample size in this group.

We can examine this by looking at a pivot table of counts instead of means.

In [9]:
student.pivot_table(index=["school"], columns=["sex", "studytime"], values=["G3"], fill_value=0, aggfunc='count') 

Unnamed: 0_level_0,G3,G3,G3,G3,G3,G3,G3,G3
sex,F,F,F,F,M,M,M,M
studytime,1,2,3,4,1,2,3,4
school,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
GP,23,99,44,17,66,77,13,10
MS,4,14,7,0,12,8,1,0


As we can see, the count of students in group 4 is the smallest in both schools.