# COGS 108 - Data Checkpoint

# Names

- Allen Zou
- Brian Han
- Dillen Padhiar
- Lian Su


# Research Question

Are there certain factors that correlate with frequent alcohol use by high school students in the United States, particularly with personal factors such as sex and age, parental and family factors such as their education, size, income, etc., school performance factors such as workload/free time, educational support, grades, etc.?


# Dataset(s)

- Dataset Name: Student Alcohol Consumption
- Link to the dataset: https://www.kaggle.com/uciml/student-alcohol-consumption
- Number of observations: 395 observations in the math class dataset, 649 observations in the Portuguese class dataset, and 382 overlapping observations

1-2 sentences describing each dataset. 

# Setup

In [53]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
# import patsy
# import statsmodels.api as sm
# import os
# import folium
# from folium import plugins
# from folium.plugins import HeatMap

# Data Cleaning

The link to the dataset contains two separate CSV files, one of which contains students from math classes and Portuguese language classes in high schools. It's important to note, however, that there are 382 students that overlap in these two subsets of classes. We load both of these datasets into DataFrames named student_math and student_portuguese.

In [54]:
# Read in the dataset
student_math = pd.read_csv('student-mat.csv')
student_portuguese = pd.read_csv('student-por.csv')

print('The math class dataset contains {} students and {} columns.'.format(len(student_math.index), len(student_math.columns)))
print('The Portuguese class dataset contains {} students and {} columns.'.format(len(student_portuguese.index), len(student_portuguese.columns)))

The math class dataset contains 395 students and 33 columns.
The Portuguese class dataset contains 649 students and 33 columns.


Unfortunately, there aren't that many students in either dataset. However, because there are only 13 students in the math class dataset that don't overlap with those in the Portuguese class dataset, we feel like it would still be fine to drop the student_math DataFrame. That is, it shouldn't affect our final analysis much.

In [55]:
# Since we're now ignoring the math class dataset, lets "rename" the Portuguese dataset to something shorter
student = student_portuguese

# See what columns are in the dataset
print("Columns in the Portuguese dataset: ", end = '')
print(*student_portuguese, sep = ', ')

Columns in the Portuguese dataset: school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences, G1, G2, G3


This dataset contains a ton of information for each student, but for the sake of simplicity for our project, we will only keep the ones that are relevant to our reserach question.

In [56]:
# Drop the irrelevant columns
student = student.loc[:, ['sex', 'age', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'guardian', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']]

We chose to drop the following for these reasons:
- school (Gabriel Pereira or Mousinho da Silveira): we feel that the difference in schools chosen shouldn't impact our final analysis much. Additionally, it's not needed for our research question.
- address (urban or rural): this also isn't relevant for our research question.
- reason (close to home, reputation, course, preference, etc.): same reason as school.
- traveltime (how long it takes to get to school): also doesn't relate to our research question much.
- nursery (binary question of attends nursery school?): same as above

What we noticed is that this dataset is already relatively clean, but we want to perform a few checks below.

In [57]:
# Are there any NaN entries?
if(student.isnull().values.any()):
    print('There are NaN values in this dataset.')
else:
    print('There are no NaN values in this dataset.')
student.head()

There are no NaN values in this dataset.


Unnamed: 0,sex,age,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,guardian,studytime,failures,schoolsup,famsup,paid,activities,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,F,18,GT3,A,4,4,at_home,teacher,mother,2,0,yes,no,no,no,yes,no,no,4,3,4,1,1,3,4,0,11,11
1,F,17,GT3,T,1,1,at_home,other,father,2,0,no,yes,no,no,yes,yes,no,5,3,3,1,1,3,2,9,11,11
2,F,15,LE3,T,1,1,at_home,other,mother,2,0,yes,no,no,no,yes,yes,no,4,3,2,2,3,3,6,12,13,12
3,F,15,GT3,T,4,2,health,services,mother,3,0,no,yes,no,yes,yes,yes,yes,3,2,2,1,1,5,0,14,14,14
4,F,16,GT3,T,3,3,other,other,father,2,0,no,yes,no,no,yes,no,no,4,3,2,1,2,5,0,11,13,13


In [64]:
#quantify the guardian column, representing 'mother' with 1, 'father' with 2
guardian_num=student['guardian'].replace({'mother':1,'father':2})
student['guardian']=guardian_num
student

Unnamed: 0,sex,age,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,guardian,studytime,failures,schoolsup,famsup,paid,activities,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,F,18,GT3,A,4,4,at_home,teacher,1,2,0,yes,no,no,no,yes,no,no,4,3,4,1,1,3,4,0,11,11
1,F,17,GT3,T,1,1,at_home,other,2,2,0,no,yes,no,no,yes,yes,no,5,3,3,1,1,3,2,9,11,11
2,F,15,LE3,T,1,1,at_home,other,1,2,0,yes,no,no,no,yes,yes,no,4,3,2,2,3,3,6,12,13,12
3,F,15,GT3,T,4,2,health,services,1,3,0,no,yes,no,yes,yes,yes,yes,3,2,2,1,1,5,0,14,14,14
4,F,16,GT3,T,3,3,other,other,2,2,0,no,yes,no,no,yes,no,no,4,3,2,1,2,5,0,11,13,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
644,F,19,GT3,T,2,3,services,other,1,3,1,no,no,no,yes,yes,yes,no,5,4,2,1,2,5,4,10,11,10
645,F,18,LE3,T,3,1,teacher,services,1,2,0,no,yes,no,no,yes,yes,no,4,3,4,1,1,1,4,15,15,16
646,F,18,GT3,T,1,1,other,other,1,2,0,no,no,no,yes,yes,no,no,1,1,1,1,1,5,6,11,12,9
647,M,17,LE3,T,3,1,services,services,1,1,0,no,no,no,no,yes,yes,no,2,4,5,3,4,2,6,10,10,10


# Project Proposal (updated)

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/31  |  2 PM | Search for datasets that relate to our research question  | Go over the datasets and decide which would be best, what analysis we would like to do on them | 
| 2/7  |  2 PM |  Data cleaning assigned to each member | Assign roles for the project report (mainly coding, writing, analysis etc) | 
| 2/14  | 2 PM  | Brainstorm ideas for EDA, specifically what parts of the data we want to compare and how we will visualize them  | Review progress on EDA, discuss which analysis methods will work best |
| 2/21  | 2 PM  | Same as the previous row: continue brainstorming ideas for EDA | Review progress on project for the check-in  |
| 2/28  | 2 PM  | Start writing code based on our ideas from the previous two meetings | Share each other’s code and start figuring out what to put on Github |
| 3/7  | 2 PM  | Have code and analysis finished, begin writing conclusion and drafting video ideas | Discuss the video portion of the project in terms of editing and content, review  |
