# Students Data Wrangling:
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gathering data</a></li>
<li><a href="#assess">Assessing data</a></li>
<li><a href="#clean">Cleaning data</a></li>
<li><a href="#store">Storing data</a></li>
<li><a href="#analyze">Analyze & Visualize</a></li>
</ul>

<a id='intro'></a>
## ***Introduction***

After the validation of our understand on how the country score is calculated, we will take the data we have, assess it, clean it and store it in a new file in order to start analyzing it.

Let's start by importing all needed libraries:

In [1]:
import numpy as np
import pandas as pd
import config # a python file that contains path to TIMSS data files

<a id='gather'></a>
## ***Gathering Data***

We believe that data from teachers could also impact the final score of students, that's why we will explore their context data at a later phase.

At this stage we intend to work only with data of $8^{th}$ grade Moroccan students:


In [2]:
# Students Data Context
df_all_Students_data = pd.read_spss(config.student_data_path_G8)

### The Columns we might be interested in within student's context data are:

<table>
    <thead>
        <tr>
            <th>Student Context</th>
            <th>Details</th>
            <th>Columns (intervals)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td rowspan=6>Home</td>
            <td>Student's personal information</td>
            <td>]4:7]</td>
        </tr>
        <tr>
            <td>Student's technology belongings</td>
            <td>]7:17]</td>
        </tr>
        <tr>
            <td>Parent's education</td>
            <td>]17:19]</td>
        </tr>
        <tr>
            <td>How far do students think they could go in education</td>
            <td>]19:20]</td>
        </tr>
        <tr>
            <td>How often the student is absent from school</td>
            <td>]24:25]</td>
        <tr>
            <td>Student's Internet usage</td>
            <td>]27:33]</td>
        </tr>
        <tr>
            <td rowspan=2>School</td>
            <td>Student's school perception</td>
            <td>]33:38]</td>
        </tr>
        <tr>
            <td>Student's feelings</td>
            <td>]38:52]</td>
        </tr>
        <tr>
            <td colspan=2>Teachers from students' eyes</td>
            <td>]62:69]</td>
        </tr>
        <tr>
            <td colspan=2>Classes from students' eyes</td>
            <td>]69:75]</td>
        </tr>
        <tr>
            <td colspan=2>Math Likability</td>
            <td>]52:62]</td>
        </tr>
        <tr>
            <td colspan=2>Math perception</td>
            <td>]75:84]</td>
        </tr>
        <tr>
            <td colspan=2>Math Importance</td>
            <td>]84:93]</td>
        </tr>
        <tr>
            <td colspan=2>Math Homeworks & Extra Lessons</td>
            <td>[127,129,131,133,248,253,258,260]</td>
        </tr>
        <tr>
            <td colspan=2>Technology usage</td>
            <td>]269:288]</td>
        </tr>
        <tr>
            <td colspan=2>Plausible Math Values</td>
            <td>]309:314]</td>
        </tr>
        <tr>
            <td colspan=2>Math data Summary</td>
            <td>]404:420]</td>
        </tr>
    </tbody>
</table>

<a id='assess'></a>
## ***Assessing Data***

Before assessing data ***quality*** and ***tidiness*** issues, It would be better if we reduce the number of columns in the file since we know those we want to keep:


In [6]:
test_df = df_all_Students_data.iloc[:, np.r_[127,129,131,133,248,253,258,260]]

In [7]:
test_df.head()

Unnamed: 0,BSBM26AA,BSBM26BA,BSBM27AA,BSBM27BA,BSBM42AA,BSBM42BA,BSBM43AA,BSBM43BA
0,3 or 4 times a week,1–15 minutes,,,3 or 4 times a week,1–15 minutes,,
1,Every day,16–30 minutes,"Yes, to keep up in class",Less than 4 months,Every day,16–30 minutes,"Yes, to keep up in class",Less than 4 months
2,3 or 4 times a week,16–30 minutes,,,3 or 4 times a week,16–30 minutes,,
3,3 or 4 times a week,1–15 minutes,No,Did not attend,3 or 4 times a week,1–15 minutes,No,Did not attend
4,3 or 4 times a week,16–30 minutes,,,3 or 4 times a week,16–30 minutes,,
