In [400]:
%%capture
from functions import *
from ipynb.fs.full.Student_Information import stud_info
from ipynb.fs.full.Courses import courses

# Assessments
---

## Assessments
---

The assessments dataframe contains information about the unique assessments in each code module and presentation.

In [401]:
# show assessments dataframe
assessments.head()

Unnamed: 0,code_module,code_presentation,id_assessment,assessment_type,date,weight
0,AAA,2013J,1752,TMA,19.0,10.0
1,AAA,2013J,1753,TMA,54.0,20.0
2,AAA,2013J,1754,TMA,117.0,20.0
3,AAA,2013J,1755,TMA,166.0,20.0
4,AAA,2013J,1756,TMA,215.0,30.0


**Size**

In [402]:
# function to get a dataframe of rows and columns
get_size(assessments)

Unnamed: 0,Count
Columns,6
Rows,206


In [403]:
md(f'''
Assessments has {len(assessments.columns)} features describing {len(assessments)} exam records.
''')


Assessments has 6 features describing 206 exam records.


---

### Assessments Contents

* **code_module**: The code module represents the code name of the course the assessment was held for.
* **code_presentation**: The presentation represents the presentation which the test was held for.
* **id_assessment**: The assessment ID is the unique identifier for each assessment.
* **assessment_type**: The assessment type represents the kind of assessment it was.
    - There are three assessment types:
        * TMA: Tutor Marked Assessment
        * CMA: Computer Marked Assessment
        * Exam: The Final Exam
* **date**: The date is how many days from the start of the course the assessment took place
* **weight**: The weight is the weighted value of the assessment. Exams should have a weight of 100 which the rest of the assessments should add to 100 in total.

* column names will be changed to be less verbose
    * code_module to module
    * code_presentation to presentation

In [404]:
assessments_rename = {'code_module':'module', 'code_presentation':'presentation'}
assessments = assessments.rename(columns=assessments_rename)

## Student Assessment

---

The Student Assessments dataframe contains information about each student and the assessments they took during the module

In [405]:
student_assessment.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
0,1752,11391,18,0,78.0
1,1752,28400,22,0,70.0
2,1752,31604,17,0,72.0
3,1752,32885,26,0,69.0
4,1752,38053,19,0,79.0


**Size**

In [406]:
# get the size of student_assessment
get_size(student_assessment)

Unnamed: 0,Count
Columns,5
Rows,173912


In [407]:
# store the size of student_assessment's columns
sa_cols = len(student_assessment.columns)
# store the size of student_assessment's rows
sa_rows = len(student_assessment)
md(f'''
Student Assessment has {sa_cols} columns and {"{:,}".format(sa_rows)} rows, which is how many exams we have data for.
''')


Student Assessment has 5 columns and 173,912 rows, which is how many exams we have data for.


---

### Student Assessment Contents

* **id_assessment**: The assessment ID is the unique identifier for the assessment the student took.
* **id_student**: The student ID is the unique identifier for the student who took the assessment.
* **date_submitted**: The date submitted is the date the student submitted the exam relevant to the start date of the module.
* **is_banked**: Whether the score for the assessment is banked indicates wheter the assessment result was transferred from a previous presentation.
    - is_banked has no relevant information to our analysis and can be removed
* **score**: The score the student received for the assessment. 40 or above is considered a passing score.

* column names will be changed to be less verbose
    * code_module to module
    * code_presentation to presentation

---

### Assessments Information

**Data Types**

In [408]:
# function to get a dataframe of data types
get_dtypes(assessments)

index,Type
module,object
presentation,object
id_assessment,int64
assessment_type,object
date,float64
weight,float64


* `id_assessments` is a categorical value and so should be converted to `string`
* `object` types should be converted to strings
* Both of the `float64` typed variables are whole numbers and should be converted to `int64`

In [409]:
# converting the data types
# id assessment to string
assessments['id_assessment'] = assessments['id_assessment'].astype(str)
# all other data types to those which support pandas <NA>
assessments = assessments.convert_dtypes(convert_integer=False)

**Null Values**

In [410]:
# function to return dataframe of nulls in columns
null_vals(assessments)

index,Null Values
module,0
presentation,0
id_assessment,0
assessment_type,0
date,11
weight,0


In [411]:
md(f'''
* We have {assessments['date'].isnull().sum()} null data points for assessment date. 
* The documentation of this dataset states that if the exam date is missing then it is as the end of the last presentation week. 
* We can find this information in the courses dataframe, and add them in to get rid of the NaNs.
''')


* We have 11 null data points for assessment date. 
* The documentation of this dataset states that if the exam date is missing then it is as the end of the last presentation week. 
* We can find this information in the courses dataframe, and add them in to get rid of the NaNs.


In [412]:
# adding the dates for the null test dates
for index, row in assessments[assessments['date'].isna()].iterrows():
    assessments.at[index, 'date'] = courses.loc[(courses['module'] == row['module']) & (courses['presentation'] == row['presentation']), 'module_presentation_length']

# reprinting to ensure it worked
dataframe(assessments.isnull().sum(), columns=['Null Values'])

Unnamed: 0,Null Values
module,0
presentation,0
id_assessment,0
assessment_type,0
date,0
weight,0


**Unique Counts**

In [413]:
# function to get unique value counts
count_unique(assessments)

index,Count
module,7
presentation,4
id_assessment,206
assessment_type,3
date,78
weight,24


In [414]:
md(f'''
There are {assessments['id_assessment'].nunique()} unique assessment ID's
''')


There are 206 unique assessment ID's


**Unique Categorical Values**

In [415]:
# function to get unique categorical values in columns
unique_vals(assessments)

index,Values
module,"['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG']"
presentation,"['2013J', '2014J', '2013B', '2014B']"
id_assessment,"['1752', '1753', '1754', '1755', '1756', '1757', '1758', '1759', '1760', '1761', '1762', '1763', '14991', '14992', '14993', '14994', '14995', '14984', '14985', '14986', '14987', '14988', '14989', '14990', '15003', '15004', '15005', '15006', '15007', '14996', '14997', '14998', '14999', '15000', '15001', '15002', '15015', '15016', '15017', '15018', '15019', '15008', '15009', '15010', '15011', '15012', '15013', '15014', '15020', '15021', '15022', '15023', '15024', '15025', '24286', '24287', '24288', '24289', '24282', '24283', '24284', '24285', '24290', '40087', '24295', '24296', '24297', '24298', '24291', '24292', '24293', '24294', '24299', '40088', '25341', '25342', '25343', '25344', '25345', '25346', '25347', '25334', '25335', '25336', '25337', '25338', '25339', '25340', '25348', '25349', '25350', '25351', '25352', '25353', '25354', '25355', '25356', '25357', '25358', '25359', '25360', '25361', '25362', '25363', '25364', '25365', '25366', '25367', '25368', '30709', '30710', '30711', '30712', '30713', '30714', '30715', '30716', '30717', '30718', '30719', '30720', '30721', '30722', '30723', '34865', '34866', '34867', '34868', '34869', '34871', '34870', '34860', '34861', '34862', '34863', '34864', '34872', '34878', '34879', '34880', '34881', '34882', '34884', '34883', '34873', '34874', '34875', '34876', '34877', '34885', '34891', '34892', '34893', '34894', '34895', '34897', '34896', '34886', '34887', '34888', '34889', '34890', '34898', '34904', '34905', '34906', '34907', '34908', '34910', '34909', '34899', '34900', '34901', '34902', '34903', '34911', '37418', '37419', '37420', '37421', '37422', '37423', '37415', '37416', '37417', '37424', '37428', '37429', '37430', '37431', '37432', '37433', '37425', '37426', '37427', '37434', '37438', '37439', '37440', '37441', '37442', '37443', '37435', '37436', '37437', '37444']"
assessment_type,"['TMA', 'Exam', 'CMA']"


Everything here is as we would expect in the data's description

In [416]:
# get the value counts for each type of exam
dataframe(assessments['assessment_type'].value_counts())

Unnamed: 0,assessment_type
TMA,106
CMA,76
Exam,24


In [417]:
TMA_count = assessments['assessment_type'].value_counts()[0]
CMA_count = assessments['assessment_type'].value_counts()[1]
exam_count = assessments['assessment_type'].value_counts()[2]
md(f'''
There are {TMA_count} Tutor Marked Assessements, {CMA_count} Computer Marked Assessments and {exam_count} Final Exams in our data.
''')


There are 106 Tutor Marked Assessements, 76 Computer Marked Assessments and 24 Final Exams in our data.


In [418]:
md('''
Our data source tells us that final exams are weighted 100 and the weights of the rest of the exams in a module should amount to 100.
 So each module should have a total weight of 200.
''')


Our data source tells us that final exams are weighted 100 and the weights of the rest of the exams in a module should amount to 100.
 So each module should have a total weight of 200.


In [419]:
pd.pivot_table(assessments, index=['presentation', 'module'], values='weight', aggfunc=np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,weight
presentation,module,Unnamed: 2_level_1
2013B,BBB,200.0
2013B,DDD,200.0
2013B,FFF,200.0
2013J,AAA,200.0
2013J,BBB,200.0
2013J,DDD,200.0
2013J,EEE,200.0
2013J,FFF,200.0
2013J,GGG,100.0
2014B,BBB,200.0


This pivot table shows the module presentation, the module and the total of the weights of the exams in the module.
Because the final exam is weighted 100 and the other exams should form another 100 we should have 200 points in each module. We see here that CCC modules 300 in total weight and GGG modules have 100 in total weight.

In [420]:
ccc_ggg_weights = assessments.loc[(assessments['module'] == 'CCC') | (assessments['module'] == 'GGG')]
ccc_ggg_weights = ccc_ggg_weights[['module', 'id_assessment', 'assessment_type', 'weight']]
ccc_ggg_weights

Unnamed: 0,module,id_assessment,assessment_type,weight
54,CCC,24286,CMA,2.0
55,CCC,24287,CMA,7.0
56,CCC,24288,CMA,8.0
57,CCC,24289,CMA,8.0
58,CCC,24282,TMA,9.0
59,CCC,24283,TMA,22.0
60,CCC,24284,TMA,22.0
61,CCC,24285,TMA,22.0
62,CCC,24290,Exam,100.0
63,CCC,40087,Exam,100.0


We can see here that CCC Modules had two final exams, and the GGG module's full course weight consisted only of a final exam. 

The data source mentions some final exams may be missing scores, so instead of dealing with this problem here, we will come back to this problem after we have explored student_assessment in case the missing final exam scores prove problematic.

**Duplicate Values:**

In [421]:
get_dupes(assessments)

There are no Duplicate Values

**Numerical Values**

In [422]:
assessments.describe().round(2)

Unnamed: 0,date,weight
count,206.0,206.0
mean,150.97,20.87
std,78.16,30.38
min,12.0,0.0
25%,81.25,0.0
50%,159.0,12.5
75%,227.0,24.25
max,269.0,100.0


In [423]:
mean_weight = round(assessments['weight'].mean(), 2)
md(f'''
The average test is weighted at {mean_weight}, which makes sense as there is normally 5 tests adding to 100 and then a final which likely brings up that mean.
''')


The average test is weighted at 20.87, which makes sense as there is normally 5 tests adding to 100 and then a final which likely brings up that mean.


---

### Student Assessment Information

**Data Types**

In [424]:
# get student_assessment column datatypes
get_dtypes(student_assessment)

index,Type
id_assessment,int64
id_student,int64
date_submitted,int64
is_banked,int64
score,float64


* `id_student` and `id_assessments` are both categorical values and so should be converted from `int64` to `string`

In [425]:
# converting the data types
student_assessment = student_assessment.astype({'id_assessment': str, 'id_student': str})
# change student_assessment datatypes to values pandas supports better
student_assessment = student_assessment.convert_dtypes()
dataframe(student_assessment.dtypes, columns=['Data Type'])

Unnamed: 0,Data Type
id_assessment,string
id_student,string
date_submitted,Int64
is_banked,Int64
score,Int64


**Duplicate Values**

In [426]:
# gives a dataframe of duplicate values if any
get_dupes(student_assessment)

There are no Duplicate Values

**Unique Value Counts**

In [427]:
# gives a dataframe of counts of unique values per column
count_unique(student_assessment)

index,Count
id_assessment,188
id_student,23369
date_submitted,312
is_banked,2
score,101


In [428]:
assmnt_count = student_assessment['id_assessment'].nunique()
total_assmnts = assessments['id_assessment'].nunique()
assmnt_unique_students = student_vle['id_student'].nunique()
si_unique_students = stud_info['id_student'].nunique()
md(f'''
* There are {assmnt_count} unique assessments that students took.
* This is less than the {total_assmnts} assessments we observed in the assessments dataframe meaning that there are some assessments on record that students did not take.
* There are {"{:,}".format(assmnt_unique_students)} in student_assessment out of the {"{:,}".format(si_unique_students)} 
students we have in student info. So {"{:,}".format(si_unique_students - assmnt_unique_students)} students from student info do not have assessment data.
''')


* There are 188 unique assessments that students took.
* This is less than the 206 assessments we observed in the assessments dataframe meaning that there are some assessments on record that students did not take.
* There are 26,074 in student_assessment out of the 28,785 
students we have in student info. So 2,711 students from student info do not have assessment data.


**Numerical Values**

In [429]:
student_assessment.describe().round(1)

Unnamed: 0,date_submitted,is_banked,score
count,173912.0,173912.0,173739.0
mean,116.0,0.0,75.8
std,71.5,0.1,18.8
min,-11.0,0.0,0.0
25%,51.0,0.0,65.0
50%,116.0,0.0,80.0
75%,173.0,0.0,90.0
max,608.0,1.0,100.0


In [430]:
mean_score = student_assessment['score'].mean().round(1)
date_max = student_assessment['date_submitted'].max()
date_min = student_assessment['date_submitted'].min()
max_course_length = courses['module_presentation_length'].max()
md(f'''
* The average test score is {mean_score} so most students are passing handily if 40 is considered a failing score.
* The minimum date_submitted is {date_min} so it is possible the students had access to the first exam early, or that there is errors in the data.
 * They don't fall too far outside the range, so we will keep those as they are
* The maximum date submitted is {date_max}, which is around 2.5 times longer than any course went on for.
* Let's check for records that are over the maximum course length of {max_course_length} days.
''')


* The average test score is 75.8 so most students are passing handily if 40 is considered a failing score.
* The minimum date_submitted is -11 so it is possible the students had access to the first exam early, or that there is errors in the data.
 * They don't fall too far outside the range, so we will keep those as they are
* The maximum date submitted is 608, which is around 2.5 times longer than any course went on for.
* Let's check for records that are over the maximum course length of 269 days.


In [431]:
late_tests = student_assessment.loc[student_assessment['date_submitted'] > 269].sort_values(by='date_submitted').reset_index(drop=True)
late_tests.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
0,15022,1723749,270,0,
1,30722,691701,274,0,
2,25368,2341830,279,0,49.0
3,24299,555498,285,0,58.0
4,34879,595935,287,0,96.0


In [432]:
late_test_count = len(late_tests['date_submitted'])
late_test_min = late_tests['date_submitted'].min()
late_test_max = late_tests['date_submitted'].max()
late_test_avg = late_tests['date_submitted'].mean().round(1)
md(f'''
* There are {late_test_count} records of students handing in their exams well after the end of the module.
* These dates range from {late_test_min} days after the course began and {late_test_max} days after the course began with an average of {late_test_avg} days.
* The data source makes no mention of these, and they should not affect our analysis, so though strange, we will leave these records
''')


* There are 73 records of students handing in their exams well after the end of the module.
* These dates range from 270 days after the course began and 608 days after the course began with an average of 470.2 days.
* The data source makes no mention of these, and they should not affect our analysis, so though strange, we will leave these records


**Null Values**

In [433]:
# get null values if any
null_vals(student_assessment)

index,Null Values
id_assessment,0
id_student,0
date_submitted,0
is_banked,0
score,173


In [434]:
null_score = student_assessment['score'].isnull().sum()
md(f'''
* We have {null_score} null values for score, which is important as it is a value we will be trying to predict.
* These may be the missing final exam scores. To know that we will have to merge the assessments and student_assessment dataframes so
we have the score, assessment ID and module/presentation in one place.
''')


* We have 173 null values for score, which is important as it is a value we will be trying to predict.
* These may be the missing final exam scores. To know that we will have to merge the assessments and student_assessment dataframes so
we have the score, assessment ID and module/presentation in one place.


**Merged Asssessments**

Here we will merge the assessments and student assessments dataframes in order to combine our student scores and submission dates with assessment type, date of the assessment, and weight of the assessment.

In [435]:
# merges dataframes student_assessment with assessments with a full outer join on their common ID id_assessment
# creates a column _merge which tells you if the id_assessment was found in one or both dataframes
merged_assessments = student_assessment.merge(assessments, how='outer', on=['id_assessment'] ,indicator=True)
merged_assessments = merged_assessments.astype({'id_student': str, 'id_assessment':str}).convert_dtypes()
merged_assessments.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score,module,presentation,assessment_type,date,weight,_merge
0,1752,11391,18,0,78,AAA,2013J,TMA,19.0,10.0,both
1,1752,28400,22,0,70,AAA,2013J,TMA,19.0,10.0,both
2,1752,31604,17,0,72,AAA,2013J,TMA,19.0,10.0,both
3,1752,32885,26,0,69,AAA,2013J,TMA,19.0,10.0,both
4,1752,38053,19,0,79,AAA,2013J,TMA,19.0,10.0,both


Our new `_merge` column tells us if the data maps perfectly, or if it is only found on the right or left side, the right side being the assessments dataframe and the left side being the student_assessments dataframe. 

In [436]:
# make a dataframe of students with a score of 0 and display it
na_scores = merged_assessments.loc[merged_assessments['score'].isna(), ['presentation', 'module', 'weight', 'id_assessment']]
dataframe(na_scores.value_counts(subset=['id_assessment', 'weight']), columns=['NA Exam Count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,NA Exam Count
id_assessment,weight,Unnamed: 2_level_1
25339,15.0,7
15013,18.0,7
34860,12.5,6
14998,18.0,5
25363,10.0,5
...,...,...
30714,16.0,1
30715,28.0,1
30718,100.0,1
30719,16.0,1


In [437]:
non_100_na = na_scores.loc[na_scores['weight'] != 100]
weight_100_na = na_scores.loc[na_scores['weight'] == 100]
md(f'''
Above we can see a dataframe of the exams which have NA scores and the count of how many NA scores they have. 
* {na_scores['id_assessment'].nunique()} out of {len(assessments)} assessments have at least one NA score.
* {weight_100_na['id_assessment'].nunique()} exams  with NA scores are final exams
* {non_100_na['id_assessment'].nunique()} exams with NA scores are not final exams
''')


Above we can see a dataframe of the exams which have NA scores and the count of how many NA scores they have. 
* 96 out of 206 assessments have at least one NA score.
* 18 exams  with NA scores are final exams
* 78 exams with NA scores are not final exams


In [438]:
# make a dataframe of students with a score of 0 and display it
zero_scores = student_assessment.loc[student_assessment['score'] == 0]
zero_scores.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
785,1754,2456480,123,0,0
4322,14984,554986,24,0,0
4730,14985,141823,46,0,0
5391,14985,542259,46,0,0
5509,14985,549078,48,0,0


In [439]:
md(f'''
The first thing to check would be whether there are students with a 0 for a score to see if the NaNs represent 0's.
Above is a dataframe of length {len(zero_scores)} for records of assessments with a 0 score, 
so the NaNs are not necessarily 0's.
''')


The first thing to check would be whether there are students with a 0 for a score to see if the NaNs represent 0's.
Above is a dataframe of length 329 for records of assessments with a 0 score, 
so the NaNs are not necessarily 0's.


In [440]:
md(f'''
**Rows that do not map**

First we will deal with the missing final exam scores. We have found that {weight_100_na['id_assessment'].nunique()} exams have students with
NA scores, but our `_merge` column suggests that there are some exams only found in assessments. Let's look at those missing exams''')


**Rows that do not map**

First we will deal with the missing final exam scores. We have found that 18 exams have students with
NA scores, but our `_merge` column suggests that there are some exams only found in assessments. Let's look at those missing exams

In [441]:
missing_exams = merged_assessments.loc[merged_assessments['_merge'] != 'both'].reset_index(drop=True)
missing_exams

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score,module,presentation,assessment_type,date,weight,_merge
0,1757,,,,,AAA,2013J,Exam,268.0,100.0,right_only
1,1763,,,,,AAA,2014J,Exam,269.0,100.0,right_only
2,14990,,,,,BBB,2013B,Exam,240.0,100.0,right_only
3,15002,,,,,BBB,2013J,Exam,268.0,100.0,right_only
4,15014,,,,,BBB,2014B,Exam,234.0,100.0,right_only
5,15025,,,,,BBB,2014J,Exam,262.0,100.0,right_only
6,40087,,,,,CCC,2014B,Exam,241.0,100.0,right_only
7,40088,,,,,CCC,2014J,Exam,269.0,100.0,right_only
8,30713,,,,,EEE,2013J,Exam,235.0,100.0,right_only
9,30718,,,,,EEE,2014B,Exam,228.0,100.0,right_only


These are the final exams completely missing from the dataset. It is possible that students have other exams that are missing but not from all students, which would not show in our data.

In [442]:
assessments.loc[assessments['assessment_type'] == 'Exam']

Unnamed: 0,module,presentation,id_assessment,assessment_type,date,weight
5,AAA,2013J,1757,Exam,268.0,100.0
11,AAA,2014J,1763,Exam,269.0,100.0
23,BBB,2013B,14990,Exam,240.0,100.0
35,BBB,2013J,15002,Exam,268.0,100.0
47,BBB,2014B,15014,Exam,234.0,100.0
53,BBB,2014J,15025,Exam,262.0,100.0
62,CCC,2014B,24290,Exam,241.0,100.0
63,CCC,2014B,40087,Exam,241.0,100.0
72,CCC,2014J,24299,Exam,269.0,100.0
73,CCC,2014J,40088,Exam,269.0,100.0


In [443]:
md(f'''
These {len(missing_exams)} rows all have entries in the assessments dataframe but have no match in the student_assessment dataframe. 
This indicates that no students in our data took these exams. These are all weight 100 and will be the missing final exams.
''')


These 18 rows all have entries in the assessments dataframe but have no match in the student_assessment dataframe. 
This indicates that no students in our data took these exams. These are all weight 100 and will be the missing final exams.


In [444]:
# locate the module and presentation for rows in merged_assessments where the weight is 100 and the score is not null
finals_w_scores = merged_assessments.loc[(merged_assessments['weight'] == 100) & (merged_assessments['score'].notna()), ['weight', 'module', 'presentation', 'id_assessment']].drop_duplicates()
finals_w_scores

Unnamed: 0,weight,module,presentation,id_assessment
52923,100.0,CCC,2014B,24290
63953,100.0,CCC,2014J,24299
69640,100.0,DDD,2013B,25340
82462,100.0,DDD,2013J,25354
87448,100.0,DDD,2014B,25361
95035,100.0,DDD,2014J,25368


In [445]:
ttl_mod_pres = (courses['module'] + courses['presentation']).nunique()
ttl_final_exams = assessments.loc[assessments['assessment_type'] == 'Exam']
mod_pres_w_scores = len(finals_w_scores)
md(f'''Above is a dataframe of all of the modules and presentations we have a final exam for. Only {mod_pres_w_scores} modules out of 
{ttl_mod_pres} have final exam scores for any students.

Since we are missing final exam data for so many students, we cannot use this to make a weighted average for students. It is particularly troublesome
because we have no final exams in GGG whose whole score is based on the final exams since the other exams in GGG are weight 0.

What we will do is base the weighted average on exams that are not the final exam and assign GGG's exams the rounded average of the weights of exams in
modules that have the same number of exams.''')

Above is a dataframe of all of the modules and presentations we have a final exam for. Only 6 modules out of 
22 have final exam scores for any students.

Since we are missing final exam data for so many students, we cannot use this to make a weighted average for students. It is particularly troublesome
because we have no final exams in GGG whose whole score is based on the final exams since the other exams in GGG are weight 0.

What we will do is base the weighted average on exams that are not the final exam and assign GGG's exams the rounded average of the weights of exams in
modules that have the same number of exams.

In [446]:
ggg_assessments = assessments.loc[(assessments['module'] == 'GGG') & (assessments['presentation'] == '2014J'), ['presentation', 'id_assessment', 'weight']]
modules = assessments['module'].unique()
presentations = assessments['presentation'].unique()
for i in modules:
    for j in presentations:
        if len(assessments.loc[(assessments['module'] == i) & (assessments['presentation'] == '2014J'), ['presentation', 'id_assessment', 'weight']])  == len(ggg_assessments):
            print(i, j)

CCC 2013J
CCC 2014J
CCC 2013B
CCC 2014B
GGG 2013J
GGG 2014J
GGG 2013B
GGG 2014B


In [447]:
merged_assessments.loc[merged_assessments['id_student'] == '65002', 'id_assessment'].values

<StringArray>
['1752', '1753', '1758', '1759']
Length: 4, dtype: string

In [448]:
merged_assessments['module_presentation'] = merged_assessments['module'] + merged_assessments['presentation']
assessments['module_presentation'] = assessments['module'] + assessments['presentation']

In [500]:
merged_assessments

Unnamed: 0_level_0,id_assessment,id_student,date_submitted,is_banked,score,module,presentation,assessment_type,date,weight,_merge,module_presentation
unique_student_entry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
11391AAA2013J,1752,11391,18,0,78,AAA,2013J,TMA,19.0,10.0,both,AAA2013J
28400AAA2013J,1752,28400,22,0,70,AAA,2013J,TMA,19.0,10.0,both,AAA2013J
31604AAA2013J,1752,31604,17,0,72,AAA,2013J,TMA,19.0,10.0,both,AAA2013J
32885AAA2013J,1752,32885,26,0,69,AAA,2013J,TMA,19.0,10.0,both,AAA2013J
38053AAA2013J,1752,38053,19,0,79,AAA,2013J,TMA,19.0,10.0,both,AAA2013J
...,...,...,...,...,...,...,...,...,...,...,...,...
<NA>FFF2014B,34898,,,,,FFF,2014B,Exam,227.0,100.0,right_only,FFF2014B
<NA>FFF2014J,34911,,,,,FFF,2014J,Exam,241.0,100.0,right_only,FFF2014J
<NA>GGG2013J,37424,,,,,GGG,2013J,Exam,229.0,100.0,right_only,GGG2013J
<NA>GGG2014B,37434,,,,,GGG,2014B,Exam,222.0,100.0,right_only,GGG2014B


In [452]:
assessments['module_presentation'] = assessments['module'] + assessments['presentation']
merged_assessments['module_presentation'] = merged_assessments['module'] + merged_assessments['presentation']

In [482]:
merged_assessments['unique_student_entry'] = merged_assessments['id_student'] + merged_assessments['module_presentation']
merged_assessments = merged_assessments.set_index('unique_student_entry')

In [498]:

def get_all_assessments():
    # initiate an empty dataframe to store students with all exams
    merged_assessments_plus = []
    temp_df = dataframe()
    

    # iterate through rows of merged_assessments 
    for i, r in merged_assessments.iterrows():
        # initiate an empty dataframe to temporarily store one student at a time
        student_assessments = merged_assessments.index[i]
        module_assessments = assessments.loc[assessments['module_presentation'] == r['module_presentation']]
        temp_df = student_assessments.merge(module_assessments, how='left', on=['id_assessment', 'module_presentation'])
        temp_df['id_student'] = r['id_student']
        merged_assessments_plus.append(temp_df.to_dict())
        print(len(merged_assessments_plus), end = "\r")

In [499]:
%lprun -f get_all_assessments get_all_assessments()


IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [466]:
dataframe.from_records(merged_assessments_plus)


Unnamed: 0,id_student,module_presentation,id_assessment,date_submitted,score,module,presentation,assessment_type,date,weight
0,"{0: '11391', 1: '11391', 2: '11391', 3: '11391...","{0: 'AAA2013J', 1: 'AAA2013J', 2: 'AAA2013J', ...","{0: '1752', 1: '1753', 2: '1754', 3: '1755', 4...","{0: 18, 1: 53, 2: 115, 3: 164, 4: 212, 5: <NA>}","{0: 78, 1: 85, 2: 80, 3: 85, 4: 82, 5: <NA>}","{0: 'AAA', 1: 'AAA', 2: 'AAA', 3: 'AAA', 4: 'A...","{0: '2013J', 1: '2013J', 2: '2013J', 3: '2013J...","{0: 'TMA', 1: 'TMA', 2: 'TMA', 3: 'TMA', 4: 'T...","{0: 19.0, 1: 54.0, 2: 117.0, 3: 166.0, 4: 215....","{0: 10.0, 1: 20.0, 2: 20.0, 3: 20.0, 4: 30.0, ..."
1,"{0: '28400', 1: '28400', 2: '28400', 3: '28400...","{0: 'AAA2013J', 1: 'AAA2013J', 2: 'AAA2013J', ...","{0: '1752', 1: '1753', 2: '1754', 3: '1755', 4...","{0: 22, 1: 52, 2: 121, 3: 164, 4: 212, 5: <NA>}","{0: 70, 1: 68, 2: 70, 3: 64, 4: 60, 5: <NA>}","{0: 'AAA', 1: 'AAA', 2: 'AAA', 3: 'AAA', 4: 'A...","{0: '2013J', 1: '2013J', 2: '2013J', 3: '2013J...","{0: 'TMA', 1: 'TMA', 2: 'TMA', 3: 'TMA', 4: 'T...","{0: 19.0, 1: 54.0, 2: 117.0, 3: 166.0, 4: 215....","{0: 10.0, 1: 20.0, 2: 20.0, 3: 20.0, 4: 30.0, ..."
2,"{0: '31604', 1: '31604', 2: '31604', 3: '31604...","{0: 'AAA2013J', 1: 'AAA2013J', 2: 'AAA2013J', ...","{0: '1752', 1: '1753', 2: '1754', 3: '1755', 4...","{0: 17, 1: 51, 2: 115, 3: 165, 4: 213, 5: <NA>}","{0: 72, 1: 71, 2: 74, 3: 88, 4: 75, 5: <NA>}","{0: 'AAA', 1: 'AAA', 2: 'AAA', 3: 'AAA', 4: 'A...","{0: '2013J', 1: '2013J', 2: '2013J', 3: '2013J...","{0: 'TMA', 1: 'TMA', 2: 'TMA', 3: 'TMA', 4: 'T...","{0: 19.0, 1: 54.0, 2: 117.0, 3: 166.0, 4: 215....","{0: 10.0, 1: 20.0, 2: 20.0, 3: 20.0, 4: 30.0, ..."
3,"{0: '32885', 1: '32885', 2: '32885', 3: '32885...","{0: 'AAA2013J', 1: 'AAA2013J', 2: 'AAA2013J', ...","{0: '1752', 1: '1753', 2: '1754', 3: '1755', 4...","{0: 26, 1: 75, 2: 124, 3: 181, 4: 222, 5: <NA>}","{0: 69, 1: 30, 2: 63, 3: 35, 4: 75, 5: <NA>}","{0: 'AAA', 1: 'AAA', 2: 'AAA', 3: 'AAA', 4: 'A...","{0: '2013J', 1: '2013J', 2: '2013J', 3: '2013J...","{0: 'TMA', 1: 'TMA', 2: 'TMA', 3: 'TMA', 4: 'T...","{0: 19.0, 1: 54.0, 2: 117.0, 3: 166.0, 4: 215....","{0: 10.0, 1: 20.0, 2: 20.0, 3: 20.0, 4: 30.0, ..."
4,"{0: '38053', 1: '38053', 2: '38053', 3: '38053...","{0: 'AAA2013J', 1: 'AAA2013J', 2: 'AAA2013J', ...","{0: '1752', 1: '1753', 2: '1754', 3: '1755', 4...","{0: 19, 1: 64, 2: 117, 3: 166, 4: 215, 5: <NA>}","{0: 79, 1: 69, 2: 74, 3: 50, 4: 68, 5: <NA>}","{0: 'AAA', 1: 'AAA', 2: 'AAA', 3: 'AAA', 4: 'A...","{0: '2013J', 1: '2013J', 2: '2013J', 3: '2013J...","{0: 'TMA', 1: 'TMA', 2: 'TMA', 3: 'TMA', 4: 'T...","{0: 19.0, 1: 54.0, 2: 117.0, 3: 166.0, 4: 215....","{0: 10.0, 1: 20.0, 2: 20.0, 3: 20.0, 4: 30.0, ..."
...,...,...,...,...,...,...,...,...,...,...
4326,"{0: '2698257', 1: '2698257', 2: '2698257', 3: ...","{0: 'AAA2014J', 1: 'AAA2014J', 2: 'AAA2014J', ...","{0: '1758', 1: '1759', 2: '1760', 3: '1761', 4...","{0: 19, 1: 54, 2: 117, 3: 166, 4: 215}","{0: 74, 1: 73, 2: 69, 3: 78, 4: 69}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}"
4327,"{0: '704156', 1: '704156', 2: '704156', 3: '70...","{0: 'AAA2014J', 1: 'AAA2014J', 2: 'AAA2014J', ...","{0: '1758', 1: '1759', 2: '1760', 3: '1761', 4...","{0: 19, 1: 54, 2: 117, 3: 166, 4: 215}","{0: 74, 1: 73, 2: 69, 3: 78, 4: 69}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}"
4328,"{0: '705379', 1: '705379', 2: '705379', 3: '70...","{0: 'AAA2014J', 1: 'AAA2014J', 2: 'AAA2014J', ...","{0: '1758', 1: '1759', 2: '1760', 3: '1761', 4...","{0: 19, 1: 54, 2: 117, 3: 166, 4: 215}","{0: 74, 1: 73, 2: 69, 3: 78, 4: 69}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}"
4329,"{0: '749412', 1: '749412', 2: '749412', 3: '74...","{0: 'AAA2014J', 1: 'AAA2014J', 2: 'AAA2014J', ...","{0: '1758', 1: '1759', 2: '1760', 3: '1761', 4...","{0: 19, 1: 54, 2: 117, 3: 166, 4: 215}","{0: 74, 1: 73, 2: 69, 3: 78, 4: 69}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}","{0: <NA>, 1: <NA>, 2: <NA>, 3: <NA>, 4: <NA>}"


In [326]:
def get_all_assessments():
    # initiate an empty dataframe to store students with all exams
    merged_assessments_plus = []

    # iterate through rows of merged_assessments 
    for i, r in merged_assessments.iterrows():
        # initiate an empty dataframe to temporarily store one student at a time
        temp_df = dataframe()
        student_assessments = merged_assessments[merged_assessments['id_student'] == r['id_student']]
        student_assessments = student_assessments[student_assessments['module_presentation'] == r['module_presentation']]
        module_assessments = assessments.loc[assessments['module_presentation'] == r['module_presentation']]
        temp_df = student_assessments.merge(module_assessments, how='outer', on=['id_assessment', 'module_presentation'])
        temp_df['id_student'] = r['id_student']
        temp_df = temp_df.to_dict()
        merged_assessments_plus.append(temp_df)
        print(len(merged_assessments_plus), end = "\r")

In [329]:
get_all_assessments()

10801

KeyboardInterrupt: 

In [327]:
%lprun -f get_all_assessments get_all_assessments()


*** KeyboardInterrupt exception caught in code being profiled.

Timer unit: 1e-07 s

Total time: 12.2085 s
File: <ipython-input-326-e4ffeafaf0ac>
Function: get_all_assessments at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def get_all_assessments():
     2                                               # initiate an empty dataframe to store students with all exams
     3         1         29.0     29.0      0.0      merged_assessments_plus = []
     4                                           
     5                                               # iterate through rows of merged_assessments 
     6       195    1008841.0   5173.5      0.8      for i, r in merged_assessments.iterrows():
     7                                                   # initiate an empty dataframe to temporarily store one student at a time
     8       195    1503179.0   7708.6      1.2          temp_df = dataframe()
     9       195   55570819.0 284978.6     45.5          student_assessments = merged_

In [322]:
%lprun -f get_all_assessments get_all_assessments()


147*** KeyboardInterrupt exception caught in code being profiled.

Timer unit: 1e-07 s

Total time: 8.3492 s
File: <ipython-input-321-2a2b3ff27334>
Function: get_all_assessments at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def get_all_assessments():
     2                                               # initiate an empty dataframe to store students with all exams
     3         1         28.0     28.0      0.0      merged_assessments_plus = []
     4                                           
     5                                               # iterate through rows of merged_assessments 
     6       147     872606.0   5936.1      1.0      for i, r in merged_assessments.iterrows():
     7                                                   # initiate an empty dataframe to temporarily store one student at a time
     8       147    1127364.0   7669.1      1.4          temp_df = dataframe()
     9       147   44063615.0 299752.5     52.8          student_assessments = datafram

In [305]:
get_all_assessments()

1669

KeyboardInterrupt: 

In [314]:
%lprun -f get_all_assessments get_all_assessments()


*** KeyboardInterrupt exception caught in code being profiled.

Timer unit: 1e-07 s

Total time: 6.15705 s
File: <ipython-input-312-5c80025da8c1>
Function: get_all_assessments at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def get_all_assessments():
     2                                               # initiate an empty dataframe to store students with all exams
     3         1         31.0     31.0      0.0      merged_assessments_plus = []
     4                                           
     5                                               # iterate through rows of merged_assessments 
     6        77     642839.0   8348.6      1.0      for i, r in merged_assessments.iterrows():
     7                                                   # initiate an empty dataframe to temporarily store one student at a time
     8        77     583669.0   7580.1      0.9              temp_df = dataframe()
     9        77   43212188.0 561197.2     70.2              student_assessments =

In [294]:
# initiate an empty dataframe to store students with all exams
merged_assessments_plus = []

# iterate through rows of merged_assessments 
for i, r in merged_assessments.iterrows():
    # initiate an empty dataframe to temporarily store one student at a time
    temp_df = dataframe()
    student_assessments = dataframe(merged_assessments.loc[(merged_assessments['id_student'] == r['id_student']) & (merged_assessments['module_presentation'] == r['module_presentation']), ['id_student', 'module_presentation', 'id_assessment', 'date_submitted', 'score']])
    module_assessments = assessments.loc[assessments['module_presentation'] == r['module_presentation']]
    temp_df = student_assessments.merge(module_assessments, how='outer', on=['id_assessment', 'module_presentation'])
    temp_df['id_student'] = r['id_student']
    temp_df = temp_df.to_dict()
    merged_assessments_plus.append(temp_df)
    print(len(merged_assessments_plus), end = "\r")

996

KeyboardInterrupt: 

In [None]:
merged_assessments_plus

In [None]:
# remove tests that students did not take
merged_assessments = merged_assessments.dropna(subset=['id_student'])
# drop the merge column since it is no longer of use
# reset the index to be consecutive again
merged_assessments = merged_assessments.drop(columns=['_merge']).reset_index(drop=True)
# order the columns
merged_assessments = merged_assessments[['module', 'presentation', 'id_student', 'id_assessment', 'assessment_type', 'date_submitted', 'date', 'weight', 'score']]
# make a list of missing exams
missing_exams_list = list(missing_exams['id_assessment'])

In [158]:
# make a dataframe of all assessments with NaN scores
NaN_scores = student_assessment.loc[student_assessment['score'].isnull() == True]

# make a dataframe of students whose score is NaN from student info
#initiate dataframe to store students with NaN scores
students_w_NaN_scores = pd.DataFrame()

# iterate through NaN_scores
for index, row in NaN_scores.iterrows():
    # if student_id from NaN scores is found in student_info_reg append that students information to a new dataframe students_w_NaN_scores
    students_w_NaN_scores = students_w_NaN_scores.append(stud_info.loc[stud_info['id_student'] == row['id_student']])

In [159]:
# display the new dataframe
students_w_NaN_scores.head()

Unnamed: 0,module,presentation,id_student,region,imd,age,gender,education,disability,attempts,credits,result,date_registration,date_unregistration
227,AAA,2013J,721259,South Region,50-60%,55<=,F,Lower Than A Level,False,0,120,Withdrawn,-73,23.0
638,AAA,2014J,721259,South Region,50-60%,55<=,F,Lower Than A Level,False,1,60,Withdrawn,-30,128.0
108,AAA,2013J,260355,London Region,80-90%,35-55,F,A Level or Equivalent,False,0,60,Withdrawn,-186,170.0
466,AAA,2014J,260355,London Region,80-90%,35-55,F,A Level or Equivalent,False,1,120,Withdrawn,-156,-87.0
733,AAA,2014J,2606802,North Region,60-70%,0-35,M,A Level or Equivalent,False,0,60,Fail,-37,


This dataframe contains the students which are missing scores for their exams

In [None]:
# get the counts of each student result within the NaN scores dataframe
dataframe(students_w_NaN_scores['result'].value_counts())

For students which withdrew or failed it makes sense that some of their test scores would be missing. For the passed students, it is possible that they still made it by without passing an exam. The student with distinction is of note. Let's check their record first.

In [None]:
# locate the student in students_w_NaN scores whose final_result was Distinction
students_w_NaN_scores.loc[students_w_NaN_scores['result'] == 'Distinction']

Above we have the student in question with ID 571765. Now let's see the rest of their test scores

In [None]:
# locate the other test scores of the student with Distinction
student_assessment.loc[student_assessment['id_student'] == '571765'].fillna(0)

According to the data source, a score of 40 or less is interpreted as failure. This student receieved excellent marks on their exams aside from the NaN value which we have filled with a 0 here. It is very possible that this student still received distinction with a 0 on one exam. Also of note is that the exam was submitted late into the module, and possibly defaulted to a 0. Let's do another test case with the first student in the dataframe of students with NaN scores who still passed.

In [None]:
# locate the students who passed with NaN test scores
students_w_NaN_scores.loc[students_w_NaN_scores['result'] == 'Pass'].head()

Here is the dataframe of students with a NaN value test score who still passed.
We will check the scores of the first student, 502717.

In [None]:
# locate the test scores of the first passing student with NaN scores
student_assessment.loc[student_assessment['id_student'] == '502717'].fillna(0)

Here we can see the test scores of the student with 502717. Once again we see that it was possible for them to have passed with a 0, and the exam was submitted late into the module.

With this information we will fill the NA values with 0's under the assumption that these exams were not turned in.

In [None]:
# putting 0 for the NA scores in student_assessment
student_assessment = student_assessment.fillna(0)

In [72]:
# make a dataframe of all assessments with NaN scores
NaN_scores = student_assessment.loc[student_assessment['score'].isnull() == True]
NaN_scores.head()

Unnamed: 0,id_assessment,id_student,date_submitted,is_banked,score
215,1752,721259,22,0,
937,1754,260355,127,0,
2364,1760,2606802,180,0,
3358,14984,186780,77,0,
3914,14984,531205,26,0,


Here is a dataframe of the assessments which are missing scores

In [73]:
# make a dataframe of students whose score is NaN from student info
#initiate dataframe to store students with NaN scores
students_w_NaN_scores = pd.DataFrame()

# iterate through NaN_scores
for index, row in NaN_scores.iterrows():
    # if student_id from NaN scores is found in student_info_reg append that students information to a new dataframe students_w_NaN_scores
    students_w_NaN_scores = students_w_NaN_scores.append(stud_info.loc[stud_info['id_student'] == row['id_student']])

In [74]:
# display the new dataframe
students_w_NaN_scores.head()

Unnamed: 0,module,presentation,id_student,region,imd,age,gender,education,disability,attempts,credits,result,date_registration,date_unregistration
227,AAA,2013J,721259,South Region,50-60%,55<=,F,Lower Than A Level,False,0,120,Withdrawn,-73,23.0
638,AAA,2014J,721259,South Region,50-60%,55<=,F,Lower Than A Level,False,1,60,Withdrawn,-30,128.0
108,AAA,2013J,260355,London Region,80-90%,35-55,F,A Level or Equivalent,False,0,60,Withdrawn,-186,170.0
466,AAA,2014J,260355,London Region,80-90%,35-55,F,A Level or Equivalent,False,1,120,Withdrawn,-156,-87.0
733,AAA,2014J,2606802,North Region,60-70%,0-35,M,A Level or Equivalent,False,0,60,Fail,-37,


This dataframe contains the students which are missing scores for their exams

In [None]:
# get the counts of each student result within the NaN scores dataframe
dataframe(students_w_NaN_scores['result'].value_counts())

For students which withdrew or failed it makes sense that some of their test scores would be missing. For the passed students, it is possible that they still made it by without passing an exam. The student with distinction is of note. Let's check their record first.

In [None]:
# locate the student in students_w_NaN scores whose final_result was Distinction
students_w_NaN_scores.loc[students_w_NaN_scores['result'] == 'Distinction']

Above we have the student in question with ID 571765. Now let's see the rest of their test scores

In [None]:
# locate the other test scores of the student with Distinction
student_assessment.loc[student_assessment['id_student'] == '571765'].fillna(0)

According to the data source, a score of 40 or less is interpreted as failure. This student receieved excellent marks on their exams aside from the NaN value which we have filled with a 0 here. It is very possible that this student still received distinction with a 0 on one exam. Also of note is that the exam was submitted late into the module, and possibly defaulted to a 0. Let's do another test case with the first student in the dataframe of students with NaN scores who still passed.

In [None]:
# locate the students who passed with NaN test scores
students_w_NaN_scores.loc[students_w_NaN_scores['result'] == 'Pass'].head()

Here is the dataframe of students with a NaN value test score who still passed.
We will check the scores of the first student, 502717.

In [None]:
# locate the test scores of the first passing student with NaN scores
student_assessment.loc[student_assessment['id_student'] == '502717'].fillna(0)

Here we can see the test scores of the student with 502717. Once again we see that it was possible for them to have passed with a 0, and the exam was submitted late into the module.

With this information we will fill the NA values with 0's under the assumption that these exams were not turned in.

In [None]:
# putting 0 for the NA scores in student_assessment
student_assessment = student_assessment.fillna(0)