In [2]:
NAME = "Matthew Martin"
COLLABORATORS = ""

#### Project Topic
For my project I will be using the "Higher Education Students Performance Evaluation Dataset Data Set" from https://archive.ics.uci.edu/ml/datasets/Higher+Education+Students+Performance+Evaluation+Dataset. This will be a classification project to determine if there are attributes, or combinations of attributes, that might predict a students' end-of-term performance. The motivation for this project is to 1) put into practice some of the techniques and skills we have learned over the semester and 2) determine if there are attributes (either of the student, their family, or their education habits) that provide a strong correlation for the student doing well or not doing well. An example of the second objective would be to see if having additional work correlates to not doing well or if always attending class correlates to doing well. This will be a particularly interesting topic to explore and to reflect on in regards to have well it fits with my experience in this program so far.

#### Data

**Citation:** 
Nevriye Yilmaz, (nevriye.yilmaz '@' neu.edu.tr) and Boran Sekeroglu (boran.sekeroglu '@' neu.edu.tr). *Higher Education Students Performance Evaluation Dataset Data Set.* https://archive.ics.uci.edu/ml/datasets/Higher+Education+Students+Performance+Evaluation+Dataset#

This data is from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.php) and comes in the form of a .CSV file. This data set “was collected from the Faculty of Engineering and Faculty of Educational Sciences students in 2019” (Yilmaz and Sekeroglu). For the purposes of my analysis I will read the .CSV file data directly from the website above rather than copying it to a local location. 

The data consists of 145 instances, 33 attributes, and characterizes missing values as “N/A.” The .CSV file contains students numbered 1-145 (presumably for anonymity) and simply lists the question identifiers from 1-30 with two extra columns containing the “COURSE ID” and “GRADE.” The data type of all of the responses is an integer (except for missing values which is described above). The questions corresponding to the question identifiers are as follows (from Yilmaz and Sekeroglu):

1. Student Age (1: 18-21, 2: 22-25, 3: above 26)
2. Sex (1: female, 2: male)
3. Graduated high-school type: (1: private, 2: state, 3: other)
4. Scholarship type: (1: None, 2: 25%, 3: 50%, 4: 75%, 5: Full)
5. Additional work: (1: Yes, 2: No)
6. Regular artistic or sports activity: (1: Yes, 2: No)
7. Do you have a partner: (1: Yes, 2: No)
8. Total salary if available (1: USD 135-200, 2: USD 201-270, 3: USD 271-340, 4: USD 341-410, 5: above 410)
9. Transportation to the university: (1: Bus, 2: Private car/taxi, 3: bicycle, 4: Other)
10. Accommodation type in Cyprus: (1: rental, 2: dormitory, 3: with family, 4: Other)
11. Mothers education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.)
12. Fathers education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.)
13. Number of sisters/brothers (if available): (1: 1, 2:, 2, 3: 3, 4: 4, 5: 5 or above)
14. Parental status: (1: married, 2: divorced, 3: died - one of them or both)
15. Mothers occupation: (1: retired, 2: housewife, 3: government officer, 4: private sector employee, 5: self-employment, 6: other)
16. Fathers occupation: (1: retired, 2: government officer, 3: private sector employee, 4: self-employment, 5: other)
17. Weekly study hours: (1: None, 2: <5 hours, 3: 6-10 hours, 4: 11-20 hours, 5: more than 20 hours)
18. Reading frequency (non-scientific books/journals): (1: None, 2: Sometimes, 3: Often)
19. Reading frequency (scientific books/journals): (1: None, 2: Sometimes, 3: Often)
20. Attendance to the seminars/conferences related to the department: (1: Yes, 2: No)
21. Impact of your projects/activities on your success: (1: positive, 2: negative, 3: neutral)
22. Attendance to classes (1: always, 2: sometimes, 3: never)
23. Preparation to midterm exams 1: (1: alone, 2: with friends, 3: not applicable)
24. Preparation to midterm exams 2: (1: closest date to the exam, 2: regularly during the semester, 3: never)
25. Taking notes in classes: (1: never, 2: sometimes, 3: always)
26. Listening in classes: (1: never, 2: sometimes, 3: always)
27. Discussion improves my interest and success in the course: (1: never, 2: sometimes, 3: always)
28. Flip-classroom: (1: not useful, 2: useful, 3: not applicable)
29. Cumulative grade point average in the last semester (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49)
30. Expected Cumulative grade point average in the graduation (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49)
31. Course ID
32. OUTPUT Grade (0: Fail, 1: DD, 2: DC, 3: CC, 4: CB, 5: BB, 6: BA, 7: AA)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy
from scipy import stats
%matplotlib inline
import math

In [2]:
st = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00623/DATA.csv', delimiter=';')

In [11]:
st.head()

Unnamed: 0,STUDENT ID,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,COURSE ID,GRADE
0,STUDENT1,2,2,3,3,1,2,2,1,1,...,1,1,3,2,1,2,1,1,1,1
1,STUDENT2,2,2,3,3,1,2,2,1,1,...,1,1,3,2,3,2,2,3,1,1
2,STUDENT3,2,2,2,3,2,2,2,2,4,...,1,1,2,2,1,1,2,2,1,1
3,STUDENT4,1,1,1,3,1,2,1,2,1,...,1,2,3,2,2,1,3,2,1,1
4,STUDENT5,2,2,1,3,2,2,1,3,1,...,2,1,2,2,2,1,2,2,1,1


A quick look at the columns in the data

In [9]:
st.columns

Index(['STUDENT ID', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11',
       '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23',
       '24', '25', '26', '27', '28', '29', '30', 'COURSE ID', 'GRADE'],
      dtype='object')

A quick look at the data type in each column

In [10]:
st.dtypes

STUDENT ID    object
1              int64
2              int64
3              int64
4              int64
5              int64
6              int64
7              int64
8              int64
9              int64
10             int64
11             int64
12             int64
13             int64
14             int64
15             int64
16             int64
17             int64
18             int64
19             int64
20             int64
21             int64
22             int64
23             int64
24             int64
25             int64
26             int64
27             int64
28             int64
29             int64
30             int64
COURSE ID      int64
GRADE          int64
dtype: object

### Data Cleaning

For data cleaning I will be doing a few things:
* First I will rename all of the columns to make the translation from the generic column name to the question easier to do
* I want to make available as much of this data set as possible so my plan is to only ignore "N/A" values in the sample space that I am concerned with for each item. For example, in my hypothesis testing of whether or not having additional work correlated to doing well (or not well) in the class, I would only ignore student's with a "N/A" in that sample space.
    * Note - there do not appear to be any "N/A" values in this data so I expect this to simply be a check and that it will not actually limit my sample space.
* For a last check, I will validate that the values I get for each hypothesis test in the sample space I am concerned with, are valid values e.g. if a student had a value of "3" for "Do you have a partner," I would ignore that student given that the only valid values are "1" or "2."

If you referenced any web sites or solutions not of your own creation, list those references here:

* List any external references or resources here
    * https://www.geeksforgeeks.org/check-for-nan-in-pandas-dataframe/

Below is an example of cleaning the data by renaming the column headers to reduce continual lookup of the corresponding meaning

In [13]:
st = st.rename(columns={"1": "Student Age", "2": "Sex", "3": "Graduated high-school type", "4": "Scholarship type",
                  "5": "Additional work", "6": "Regular artistic or sports activity", "7": "Do you have a partner",
                  "8": "Total salary if available", "9": "Transportation to the university", "10": "Accomodation type in Cyprus",
                  "11": "Mothers education", "12": "Fathers education", "13": "Number of sisters/brothers",
                  "14": "Parental status", "15": "Mothers occupation", "16": "Fathers occupation", "17": "Weekly study hours",
                  "18": "Reading frequency (non-scientific)", "19": "Reading frequency (scientific)",
                  "20": "Attendance to the seminars/conferences related to the department",
                  "21": "Impact to your projects.activities on your success", "22": "Attendance to classes",
                  "23": "Preperation to midterm exams 1", "24": "Preperation to midterm exams 2", "25": "Taking notes in class",
                  "27": "Discussion improves my interest on success in the course", "28": "Flip-classroom",
                  "29": "Cumulative grade point average in the last semester", "30": "Expected cumulative grage point average in the graduation",
                  "31": "Course ID:", "32": "output"})
st

Unnamed: 0,STUDENT ID,Student Age,Sex,Graduated high-school type,Scholarship type,Additional work,Regular artistic or sports activity,Do you have a partner,Total salary if available,Transportation to the university,...,Preperation to midterm exams 1,Preperation to midterm exams 2,Taking notes in class,26,Discussion improves my interest on success in the course,Flip-classroom,Cumulative grade point average in the last semester,Expected cumulative grage point average in the graduation,COURSE ID,GRADE
0,STUDENT1,2,2,3,3,1,2,2,1,1,...,1,1,3,2,1,2,1,1,1,1
1,STUDENT2,2,2,3,3,1,2,2,1,1,...,1,1,3,2,3,2,2,3,1,1
2,STUDENT3,2,2,2,3,2,2,2,2,4,...,1,1,2,2,1,1,2,2,1,1
3,STUDENT4,1,1,1,3,1,2,1,2,1,...,1,2,3,2,2,1,3,2,1,1
4,STUDENT5,2,2,1,3,2,2,1,3,1,...,2,1,2,2,2,1,2,2,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
140,STUDENT141,2,1,2,3,1,1,2,1,1,...,1,1,2,1,2,1,3,3,9,5
141,STUDENT142,1,1,2,4,2,2,2,1,4,...,1,1,3,2,2,1,5,3,9,5
142,STUDENT143,1,1,1,4,2,2,2,1,1,...,1,1,3,3,2,1,4,3,9,1
143,STUDENT144,2,1,2,4,1,1,1,5,2,...,2,1,2,1,2,1,5,3,9,4


Below is a check to see if there are any "N/A" or erroneous values in "Additional work"

In [20]:
#print(st["Additional work"].values)
values = []
for i in st["Additional work"].values:
    if i not in values:
        values.append(i)

# the only available values for "Additional work" are 1: Yes, and 2: no
# the below is a check to ensure no erroneous values in this sample space
AvailableValues = [1,2]

# the below is a check for any NaN values
nanValues = st["Additional work"].isnull().values.any()

print("Values in 'Additional work'", values)
print("Are the values in 'Additional work' valid?", values == AvailableValues)
print("Are there any NaN values?", nanValues)

Values in 'Additional work' [1, 2]
Are the values in 'Additional work' valid? True
Are there any NaN values? False
