# Purpose
The purpose of this notebook is to document the investigation on the Rendimiento (primary and secondsary) Schools dataset.

## Import csv to pandas dataframe

In [16]:
import pandas as pd

df = pd.read_csv('../raw/schools/rendimiento/20160212_Rendimiento_2015_20160131_PUBL.csv', sep=';', decimal=',')
cols = ['MRUN', 'NOM_RBD', 'NOM_COM_RBD', 'RURAL_RBD', 'COD_GRADO', 'LET_CUR', 'GEN_ALU', 'NOM_COM_ALU', 'PROM_GRAL', 'ASISTENCIA', 'SIT_FIN_R']
df = df[cols]

## Description of desired columns

| Column ID        | Description              |
|------------------|--------------------------|
| MRUN             | Unique ID of student     |
| NOM_RBD          | Name of School           |
| NOM_COM_RBD      | Name of Comuna of school |
| RURAL_RBD        | Rural or urban           |
| COD_GRADO        | Grade code               |
| LET_CUR          | Course Letter            |
| GEN_ALU          | Gender of student        |
| NOM_COM_ALU      | Comuna of residence      |
| PROM_GRAL        | GPA                      |
| ASISTENCIA       | Attendance               |
| SIT_FIN_R        | Promotion: P: Promoted, R: Failed, Y: Retired, T: Transferred  |


## Is the MRUN student ID unique?
To determine if the MRUN colum is unique we'll use a very simple built-in pandas function [value_counts()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html). Given a column, this function will return the frequency of that specific value in the dataframe. So if anything returns a value > 1, we know this is a duplicate.

In [17]:
student_freq = df['MRUN'].value_counts()
dupes = student_freq[student_freq > 1]
print "Number of duplicate studen IDs:", len(dupes)
print "Percentage of duplicates: {:.0%}".format(len(dupes) / float(len(student_freq)))

Number of duplicate studen IDs: 183735
Percentage of duplicates: 6%


## Investigate duplicate students
By looking up individual rows with duplicate MRUNs, we may be able to spot a pattern.

In [18]:
print dupes.head()
# student with ID 20924355 has 21 entries!
# let's lookup that individual record
print df[df['MRUN'] == 20924355]

20924355    21
15606561    21
15332159    19
5993806     15
6675325     14
Name: MRUN, dtype: int64
             MRUN                                   NOM_RBD  NOM_COM_RBD  \
204567   20924355         ESCUELA ARTURO VILLALON SIEULANNE       OVALLE   
207200   20924355     ESCUELA PEDRO ENRIQUE ALFONSO BARRIOS       OVALLE   
213789   20924355                         ESCUELA SAMO ALTO  RIO HURTADO   
218550   20924355                    ESCUELA BASICA AMERICA   COMBARBALA   
224389   20924355               ESCUELA JUAN CARRASCO RISCO      ILLAPEL   
225392   20924355  ESCUELA BASICA LAS PALMERAS DE LIMAHUIDA      ILLAPEL   
225474   20924355      ESCUELA BASICA GRACIELA DIAZ ALLENDE      ILLAPEL   
227352   20924355              ESCUELA PARTICULAR LAS CANAS      ILLAPEL   
228672   20924355     ESCUELA BASICA BERTA HIDALGO BARAHONA    SALAMANCA   
228769   20924355                          ESCUELA EL TAMBO    SALAMANCA   
229001   20924355     ESCUELA BASICA GUISELA GAMBOA SALINAS    S

## Pattern
The pattern (at least for this record) is that even though this student ID appears 21 times, they only have 1 entry with an attendance rate > 0. Also all of the entries in the promotion column are 'T' which stands for 'Transferred' except 1 which is 'P' or 'Promoted'.

Let's look up a few more to see if this pattern continues...

In [19]:
print dupes.tail()
print df[df['MRUN'] == 6675325]
print df[df['MRUN'] == 11689547]

15490241    2
13306994    2
4189721     2
1198205     2
11689547    2
Name: MRUN, dtype: int64
            MRUN                                      NOM_RBD    NOM_COM_RBD  \
282995   6675325                        ESCUELA BASICA ARAUCO       QUILLOTA   
512854   6675325       ESC. CONTRAMAESTRE CONSTANTINO MICALVI     LAS CABRAS   
530262   6675325                   COLEGIO BASICO CONSOLIDADO       NANCAGUA   
541555   6675325                ESCUELA MUNICIPAL DE PALMILLA       PALMILLA   
543510   6675325             COLEGIO MANUEL RODRIGUEZ ERDOIZA      PERALILLO   
547003   6675325             ESCUELA PROF. MONICA SILVA GOMEZ    LA ESTRELLA   
1851442  6675325                   ESCUELA LUIS MATTE LARRAIN    PUENTE ALTO   
1943794  6675325                    LICEO REPUBLICA DE ITALIA  ISLA DE MAIPO   
1953595  6675325           ESCUELA MAND EDUARDO FREI MONTALVA       PE�AFLOR   
1983305  6675325       LICEO MUNICIPAL SARA TRONCOSO TRONCOSO          ALHUE   
2375463  6675325         

Here we see that ID '6675325' matches the above patter, only 1 record containing an attendance value > 0 but the second ID '11689547' does not. This second ID has no entries with an attendance value > 0, looking at the 'Promotion' column we notice there is 1 entry for 'Transferred' and 1 for 'Retired' all with attendance values of 0.

This is now a judgement call, but I would think we should remove any duplicate entries that do not have any attendances > 0.

## Next question - unique entries with no attendance
Now this leads us to our next question of whether or not there are any unique student entries with 0 attendance and are not categorized as transfers.

In [20]:
uniq_student_ids = student_freq[student_freq == 1]
uniq_student_rows = df[df["MRUN"].isin(uniq_student_ids.index)]
# filter our uniq_student_rows by those without attendance and not transfers
no_attd_no_t = uniq_student_rows[(uniq_student_rows["ASISTENCIA"] == 0) & (uniq_student_rows["SIT_FIN_R"]!='T')]
promotion_vc = no_attd_no_t.SIT_FIN_R.value_counts()
print promotion_vc.head()

Y    89570
        99
Name: SIT_FIN_R, dtype: int64


So we have 89,570 entries of 'Retired' students and 99 blanks. These can all be dropped as well.

## Further question - duplicates WITH attendance
Looking at only the duplicate entries in the dataset, find those with attendacne > 0 (if there are any).

In [21]:
dupes_df = df[df["MRUN"].isin(dupes.index)]
dupes_df_w_attendance = dupes_df[dupes_df["ASISTENCIA"] > 0]
dupes_w_attendance_freq = dupes_df_w_attendance['MRUN'].value_counts()
still_dupes = dupes_w_attendance_freq[dupes_w_attendance_freq > 1]
print "Still duplicates:", len(still_dupes)

Still duplicates: 42


Let's look at some of these entries in the full duplicate dataset.

In [22]:
print dupes_df_w_attendance[dupes_df_w_attendance['MRUN'] == 16641214]
print; print '------------------'; print;
print dupes_df_w_attendance[dupes_df_w_attendance['MRUN'] == 19884856]

             MRUN               NOM_RBD NOM_COM_RBD  RURAL_RBD  COD_GRADO  \
2130512  16641214  COLEGIO LITTLE STARS     IQUIQUE          0          3   
2161731  16641214  COLEGIO GOLDEN NORTH     IQUIQUE          0          3   

        LET_CUR  GEN_ALU NOM_COM_ALU  PROM_GRAL  ASISTENCIA SIT_FIN_R  
2130512       A        2   LA GRANJA        0.0         100         T  
2161731       A        2   LA GRANJA        6.4          85         P  

------------------

             MRUN                    NOM_RBD   NOM_COM_RBD  RURAL_RBD  \
1386122  19884856    COLEGIO NIDO DE AGUILAS  LO BARNECHEA          0   
2866545  19884856  COLEGIO MAIMONIDES SCHOOL  LO BARNECHEA          0   

         COD_GRADO LET_CUR  GEN_ALU   NOM_COM_ALU  PROM_GRAL  ASISTENCIA  \
1386122          4       A        1  LO BARNECHEA        6.9          92   
2866545          5       A        1  LO BARNECHEA        6.6          97   

        SIT_FIN_R  
1386122         P  
2866545         P  


## Develop algorithm to deal with duplicates with attendance
Our algorithm to deal with this edge case will be that we'll take the entry with the highest attendance rate (meaning they had spent the least time absent at this school). Another course of action could be to take the average of these rows, but it would lead us with some columns that don't aggregate well, like the rural/urban column or any binary columns.

In [23]:
# now of these STILL duplicates, we drop any transfers
dupes_df_w_attendance = dupes_df_w_attendance[dupes_df_w_attendance["SIT_FIN_R"] != "T"]

# now if there are still duplicates even after this filtering, we take the 
# entry with the highest attendance rate
dupes_to_drop = []
for mrun in still_dupes.index:
    mrun_df = dupes_df_w_attendance[dupes_df_w_attendance['MRUN'] == mrun]
    dupes_to_drop.append(mrun_df["ASISTENCIA"].idxmin())

print dupes_to_drop

[2161731, 1386122, 2581545, 556492, 3233897, 2262729, 1229418, 2581520, 179175, 556550, 247422, 1058739, 3223977, 1066341, 3098129, 553510, 2266523, 2302332, 2149421, 2534545, 1058790, 247356, 2432252, 985645, 976671, 3222046, 3017803, 1999998, 1386324, 1784471, 277956, 2779472, 1724317, 39511, 39390, 247416, 2621160, 2403389, 2990307, 1395341, 2000135, 369972]
