In [1]:
import numpy as np
import pandas as pd
import config # a python file that contains path to TIMSS data files
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [2]:
df_student_g8 = pd.read_csv(config.cleaned_G8_student_data_path)

In [3]:
df_student_g8.shape

(8458, 107)

In [4]:
df_student_g8.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8458 entries, 0 to 8457
Columns: 107 entries, IDSTUD to mean_PV
dtypes: float64(10), object(97)
memory usage: 6.9+ MB


Our objective is to build a ML model that will help us predict a student's TIMSS score based on other existing information related to that student.

We do have 106 columns possible features and 1 label which is ***mean_PV***. It is not practicle to build an application and ask people to enter 106 information about one student in order to guess one TIMSS score. We need to drastically reduce the number of features while keeping the most possible information that could help make the wished prediction.

Based on the "all_cleaned_profiles.html" file, we do have many groups of features that are highly correlated with each other. We have to remove multi-colinearity:

We have :
- 1. BSBGHER, BSBG04, BSBG05D
- 2. BSDGHER, 1. BSBGHER, BSBG04
- 3. BSBGSSB, BSBDSSB, BSBG13A, BSBG13B, BSBG13C, BSBG13E
- 4. BSDGSSB, BSBG13A, BSBG13B, BSBG13C, BSBG13D
- 5. BSBGSB, BSDGSB *Both are correlated only with each other* 
- 6. BSBGSLM, BSBM16A, BSBM16D, BSBM16E, BSBM16F, BSBM16G, BSBM16H, BSBM16I, 8. BSBGICM, 12. BSBGSCM, 14. BSBGSVM
- 7. BSDGSLM, 6. BSBGSLM, BSBM16A, BSBM16D, BSBM16E, BSBM16F, BSBM16G, BSBM16H, BSBM16I
- 8. BSBGICM, 6. BSBGSLM, BSBM17B, BSBM17C, BSBM17D, BSBM17E, BSBM17F, BSBM17G
- 9. BSDGICM, 8. BSBGICM, BSBM17B, BSBM17C, BSBM17D, BSBM17E, BSBM17F, BSBM17G
- 10. BSBGDML, BSDGDML *Both are correlated only with each other*
- 11. BSBGSCM, 6. BSBGSLM, 13. BSDGSCM
- 12. BSDGSCM, 11. BSBGSCM
- 13. BSBGSVM, 6. BSBGSLM, BSBM20A, BSBM20B, BSBM20C, BSBM20D, BSBM20E, BSBM20F, BSBM20G, BSBM20I, BSDGSVM
- 14. BSDGSVM, 14. BSBGSVM, BSBM20B, BSBM20C, BSBM20D, BSBM20F, BSBM20G, BSBM20I

The following table represents the summary made by TIMSS researchers based on students data on different students' aspects. Each column was derived from a set of students' answers and all are fairly representing the whole students file:
Column | Meaning
:---: | :---:
 | BSBGHER:	| Home Educational Resources/SCL |
 | BSDGHER:	| Home Educational Resources/IDX |
 | BSBGSSB:	| Students Sense of School Belonging/SCL  |
 | BSDGSSB:	| Students Sense of School Belonging/IDX  |
 | BSBGSB: 	| Student Bullying/SCL  |
 | BSDGSB: 	| Student Bullying/IDX  |
 | BSBGSLM:	| Students Like Learning Mathematics Lessons/SCL  |
 | BSDGSLM:	| Students Like Learning Mathematics Lessons/IDX  |
 | BSBGICM: |  Instructional Clarity in Mathematics Lessons/SCL  |
 | BSDGICM:	| Instructional Clarity in Mathematics Lessons/IDX  |
 | BSBGDML:	| Disorderly Behavior during Math Lessons/SCL  |
 | BSDGDML:	| Disorderly Behavior during Math Lessons/IDX  |
 | BSBGSCM:	| Student Confident in Mathematics/SCL |
 | BSDGSCM:	| Student Confident in Mathematics/IDX |
 | BSBGSVM:	| Students Value Mathematics/SCL  |
 | BSDGSVM:	| Students Value Mathematics/IDX  |


We have seen above that every column from the derived summary is highly correlated with 1 or more columns from the actual questionnaires answered by students except for 2 pairs of columns that are only correlated with each other. 

Since we can't get those derived answers only if we got all answers from students' questionnaire, we can replace these derived answers that are representing all students' answers with one of there highly correlated actual students' answers. For the exception 2 pairs of columns we can find all the actual columns they are representing and choose one or more of them at the moment.

The chosen columns and their meaning are as follows:

Column | Meaning
:---: | :---:
 | BSBG05D:	| HOME POSSESS\INTERNET CONNECTION |
 | BSBG04:	| AMOUNT OF BOOKS IN YOUR HOME |
 | BSBG13E:	| AGREE\PROUD TO GO TO THIS SCHOOL |
 | BSBG13D:	| AGREE\FAIR TEACHERS  |
 | BSBG14A: | GEN\HOW OFTEN\SAID MEAN THINGS  |
 | BSBM16A:	| AGREE\ENJOY LEARNING MATHEMATICS |
 | BSDGSLM:	| NO NEED: correlated with BSBM16A  |
 | BSBM17D: | MATH\AGREE\TEACHER EXPLAINS GOOD  |
 | BSDGICM:	| NO NEED: correlated with BSBM17D |
 | BSBM18C:	| MAT\HOW OFTEN\TOO DISORDERLY TO WORK |
 | BSBGSCM:	| NO NEED: correlated with BSBM16A  |
 | BSBM20E:	| MATH\AGREE\JOB INVOLVING MATHEMATICS  |
 | BSBM20I:	| MATH\AGREE\IMPORTANT TO DO WELL IN MATH  |


Let's explore them and their relation with the label (mean_PV) column:

In [5]:
df_summary_g8 = df_student_g8[['BSBG05D','BSBG04','BSBG13E','BSBG13D','BSBG14A','BSBM16A','BSBM17D','BSBM18C','BSBM20E','BSBM20I','mean_PV']]

In [6]:
df_summary_g8.shape

(8458, 11)

In [7]:
from ydata_profiling import ProfileReport

In [9]:
ml_profiles_1 = ProfileReport(df_summary_g8)

In [10]:
ml_profiles_1.to_file("ml_profiles_1.html")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={"index": "df_index"}, inplace=True)


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
df_summary_g8.to_csv('data/TIMSS-2019_data/TIMSS-2019_Morocco_8th/ml_data_1.csv', index=False)