# Students Data Wrangling:
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gathering data</a></li>
<li><a href="#assess">Assessing data</a></li>
<li><a href="#clean">Cleaning data</a></li>
<li><a href="#store">Storing data</a></li>
<li><a href="#analyze">Analyze & Visualize</a></li>
</ul>

<a id='intro'></a>
## ***Introduction***

After the validation of our understand on how the country score is calculated, we will take the data we have, assess it, clean it and store it in a new file in order to start analyzing it.

Let's start by importing all needed libraries:

In [2]:
import numpy as np
import pandas as pd
import config # a python file that contains path to TIMSS data files

<a id='gather'></a>
## ***Gathering Data***

We believe that data from teachers could also impact the final score of students, that's why we will explore their context data at a later phase.

At this stage we intend to work only with data of $8^{th}$ grade Moroccan students:


In [3]:
# Students Data Context
df_all_Students_data = pd.read_spss(config.student_data_path_G8)

### The Columns we might be interested in within student's context data are:

<table>
    <thead>
        <tr>
            <th>Student Context</th>
            <th>Details</th>
            <th>Columns (intervals)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td rowspan=6>Home</td>
            <td>Student's personal information</td>
            <td>]4:7]</td>
        </tr>
        <tr>
            <td>Student's technology belongings</td>
            <td>]7:17]</td>
        </tr>
        <tr>
            <td>Parent's education</td>
            <td>]17:19]</td>
        </tr>
        <tr>
            <td>How far do students think they could go in education</td>
            <td>]19:20]</td>
        </tr>
        <tr>
            <td>How often the student is absent from school</td>
            <td>]24:25]</td>
        <tr>
            <td>Student's Internet usage</td>
            <td>]27:33]</td>
        </tr>
        <tr>
            <td rowspan=2>School</td>
            <td>Student's school perception</td>
            <td>]33:38]</td>
        </tr>
        <tr>
            <td>Student's feelings</td>
            <td>]38:52]</td>
        </tr>
        <tr>
            <td colspan=2>Teachers from students' eyes</td>
            <td>]62:69]</td>
        </tr>
        <tr>
            <td colspan=2>Classes from students' eyes</td>
            <td>]69:75]</td>
        </tr>
        <tr>
            <td colspan=2>Math Likability</td>
            <td>]52:62]</td>
        </tr>
        <tr>
            <td colspan=2>Math perception</td>
            <td>]75:84]</td>
        </tr>
        <tr>
            <td colspan=2>Math Importance</td>
            <td>]84:93]</td>
        </tr>
        <tr>
            <td colspan=2>Math Homeworks & Extra Lessons</td>
            <td>[127,129,131,133,248,253,258,260]</td>
        </tr>
        <tr>
            <td colspan=2>Technology usage</td>
            <td>]269:288]</td>
        </tr>
        <tr>
            <td colspan=2>Plausible Math Values</td>
            <td>]309:314]</td>
        </tr>
        <tr>
            <td colspan=2>Math data Summary</td>
            <td>]404:420]</td>
        </tr>
    </tbody>
</table>

<a id='assess'></a>
## ***Assessing Data***

Before assessing data ***quality*** and ***tidiness*** issues, It would be better if we reduce the number of columns in the file since we know those we want to keep:


In [4]:
df_all_Students_data['mean_PV'] = df_all_Students_data.iloc[:, np.r_[309:314]].mean(axis=1)

In [5]:
df_all_Students_data.shape

(8458, 461)

In [6]:
df_math_data = df_all_Students_data.iloc[:, np.r_[4:20,24:25,27:93,127,129,131,133,248,253,258,260,269:288,309:314,404:420,460]]

In [7]:
df_math_data.shape

(8458, 132)

In [8]:
df_math_data.sample(1)

Unnamed: 0,IDSTUD,BSBG01,BSBG03,BSBG04,BSBG05A,BSBG05B,BSBG05C,BSBG05D,BSBG05E,BSBG05F,...,BSDGSLM,BSBGICM,BSDGICM,BSBGDML,BSDGDML,BSBGSCM,BSDGSCM,BSBGSVM,BSDGSVM,mean_PV
3221,50940404.0,Girl,Always,Enough to fill one shelf (11–25 books),Yes,Yes,No,No,Yes,Yes,...,Very Much Like Learning Mathematics,9.52219,Moderate Clarity of Instruction,9.19981,Some Lessons,10.59469,Somewhat Confident in Mathematics,11.3019,Strongly Value Mathematics,455.849746


In [10]:
df_math_data.to_csv('data/TIMSS-2019_data/TIMSS-2019_Morocco_8th/student_math_data.csv', index=False)

In [9]:
# from pandas_profiling import ProfileReport is old (depricated)
"""
I installed The following:
pip install -U ydata-profiling
pip install ipywidgets
"""
from ydata_profiling import ProfileReport # New 


In [16]:
all_profiles = ProfileReport(df_math_data)

In [17]:
all_profiles.to_file("all_profiles.html")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={"index": "df_index"}, inplace=True)


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Let's Explore data related only to students' home context:

In [10]:
df_home = df_all_Students_data.iloc[:, np.r_[4:20,24:25,27:33,460]]

In [11]:
df_home.info(show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8458 entries, 0 to 8457
Data columns (total 24 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   IDSTUD   8458 non-null   category
 1   BSBG01   8456 non-null   category
 2   BSBG03   7314 non-null   category
 3   BSBG04   8341 non-null   category
 4   BSBG05A  8373 non-null   category
 5   BSBG05B  8366 non-null   category
 6   BSBG05C  8375 non-null   category
 7   BSBG05D  8286 non-null   category
 8   BSBG05E  8361 non-null   category
 9   BSBG05F  8317 non-null   category
 10  BSBG05G  8314 non-null   category
 11  BSBG05H  8308 non-null   category
 12  BSBG05I  8318 non-null   category
 13  BSBG06A  8237 non-null   category
 14  BSBG06B  8256 non-null   category
 15  BSBG07   8292 non-null   category
 16  BSBG10   8277 non-null   category
 17  BSBG12A  8279 non-null   category
 18  BSBG12B  8309 non-null   category
 19  BSBG12C  8298 non-null   category
 20  BSBG12D  8236 non-null   categ

In [12]:
home_profile = ProfileReport(df_home)
home_profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [13]:
# saving to html format
home_profile.to_file("home_profile.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]