### Cleaning Data

Each row should represent one observation belonging to a particular category/instance so it is easier to work with for analytical purposes as it is granular in nature.

Each column is considered a variable with multiple distinct values across which we can derive certain insights and prepare reports from our data set.

In [59]:
import pandas as pd
import glob

In order to make one large DF out of several files, assuming they all hold the same kind of information, can use glob.

In [60]:
student_files = glob.glob("exams*.csv")#read all file names in current directory starting with exams

df_list = []
for filename in student_files:
  data = pd.read_csv(filename)
  df_list.append(data)
    
students = pd.concat(df_list)

student_files

['exams6.csv',
 'exams7.csv',
 'exams5.csv',
 'exams4.csv',
 'exams0.csv',
 'exams1.csv',
 'exams3.csv',
 'exams2.csv',
 'exams9.csv',
 'exams8.csv']

In [61]:
students.head()

Unnamed: 0,id,full_name,gender_age,fractions,probability,grade
0,0,Moses Kirckman,M14,69%,89%,11th grade
1,1,Timofei Strowan,M18,63%,76%,11th grade
2,2,Silvain Poll,M18,69%,77%,9th grade
3,3,Lezley Pinxton,M18,,72%,11th grade
4,4,Bernadene Saunper,F17,72%,84%,11th grade


In [62]:
len(students)

1000

Print out the column names:

In [63]:
students.columns

Index(['id', 'full_name', 'gender_age', 'fractions', 'probability', 'grade'], dtype='object')

### Reorganise the data
At the moment the colums are not individual variables and the rows are not specific enough observations for a given instance, in this case the student name.


There is a column for the scores on the fractions exam, and a column for the scores on the probabilities exam.

We want to make each row an observation, so we want to transform this table to look like:

 |full_name|exam|
 |-------:|:------|
|“First Student”|“Fractions"|
|“First Student”|“Probabilities"|
|“Second Student”|	“Fractions”|
|“Second Student”|	“Probabilities”|	

Use pd.melt() to create a new table (still called students) that follows this structure.

In [64]:
students_melted = pd.melt(frame=students,
                          id_vars=['full_name', 'gender_age', 'grade'],
                          value_vars=['fractions','probability'],
                          value_name="score", var_name="exam")
students_melted.head()

Unnamed: 0,full_name,gender_age,grade,exam,score
0,Moses Kirckman,M14,11th grade,fractions,69%
1,Timofei Strowan,M18,11th grade,fractions,63%
2,Silvain Poll,M18,9th grade,fractions,69%
3,Lezley Pinxton,M18,11th grade,fractions,
4,Bernadene Saunper,F17,11th grade,fractions,72%


Now each row has specific information for each instance, so each student will have 2 rows now, one for each exam they took.

In [67]:

students_melted.exam.value_counts()


probability    1000
fractions      1000
Name: exam, dtype: int64

As seen above, there are 1000 entries for probability and 1000 entries for fractions, esentially doubleing the data, but now it is more granular and easier to process.