# CS2006 Advanced Programming Projects

## Python - Group Project 2

## Data analysis with pandas

In [1]:
import pandas as pd
import sys 
import os

sys.path.append("../code")

import consistency

We start with exploring the content of the dataset.

In [2]:
df = pd.read_csv("../data/census2011.csv")
df

Unnamed: 0,Person ID,Region,Residence Type,Family Composition,Population Base,Sex,Age,Marital Status,Student,Country of Birth,Health,Ethnic Group,Religion,Economic Activity,Occupation,Industry,Hours worked per week,Approximated Social Grade
0,7394816,E12000001,H,2,1,2,6,2,2,1,2,1,2,5,8,2,-9,4
1,7394745,E12000001,H,5,1,1,4,1,2,1,1,1,2,1,8,6,4,3
2,7395066,E12000001,H,3,1,2,4,1,2,1,1,1,1,1,6,11,3,4
3,7395329,E12000001,H,3,1,2,2,1,2,1,2,1,2,1,7,7,3,2
4,7394712,E12000001,H,3,1,1,5,4,2,1,1,1,2,1,1,4,3,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
569736,7946020,W92000004,H,1,1,1,5,1,2,1,4,1,9,1,8,8,3,3
569737,7944310,W92000004,H,3,1,1,3,1,2,1,2,1,1,1,7,4,3,4
569738,7945374,W92000004,H,3,1,1,1,1,1,1,1,1,2,-9,-9,-9,-9,-9
569739,7944768,W92000004,H,1,1,2,8,5,2,1,3,1,9,5,9,2,-9,4


## Cleaning and verification of data ##
We check that the data frame matches the expected data form and fix any issues.

In [3]:
consistency.cleanDataFrame(df)

Checking for problem values...
Value checking finished.
Checking types...
Discrepancy of type in column  Residence Type expected string found object
Type checking finished.
Retyping columns ['Residence Type'] ...
Retyping Residence Type from <class 'str'> to string
Retyping finished


Unnamed: 0,Person ID,Region,Residence Type,Family Composition,Population Base,Sex,Age,Marital Status,Student,Country of Birth,Health,Ethnic Group,Religion,Economic Activity,Occupation,Industry,Hours worked per week,Approximated Social Grade
0,7394816,E12000001,H,2,1,2,6,2,2,1,2,1,2,5,8,2,-9,4
1,7394745,E12000001,H,5,1,1,4,1,2,1,1,1,2,1,8,6,4,3
2,7395066,E12000001,H,3,1,2,4,1,2,1,1,1,1,1,6,11,3,4
3,7395329,E12000001,H,3,1,2,2,1,2,1,2,1,2,1,7,7,3,2
4,7394712,E12000001,H,3,1,1,5,4,2,1,1,1,2,1,1,4,3,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
569736,7946020,W92000004,H,1,1,1,5,1,2,1,4,1,9,1,8,8,3,3
569737,7944310,W92000004,H,3,1,1,3,1,2,1,2,1,1,1,7,4,3,4
569738,7945374,W92000004,H,3,1,1,1,1,1,1,1,1,2,-9,-9,-9,-9,-9
569739,7944768,W92000004,H,1,1,2,8,5,2,1,3,1,9,5,9,2,-9,4


In [4]:
df["Residence Type"]

0         H
1         H
2         H
3         H
4         H
         ..
569736    H
569737    H
569738    H
569739    H
569740    H
Name: Residence Type, Length: 569741, dtype: string

## Save cleaned data ##
Save the cleaned data to a separate file so we can reuse it later.

In [5]:
df.to_csv("../data/census2011-clean.csv")

# Design
Initially, we simply enumerated the possible values of each column and tested
each column.
This was a very simplistic approach, and allowed us to rapidly evaluate 
the quality of the data.
By first doing a quick analysis of the data, we were able to make an informed
decision of how to handle invalid data.
Since there were no invalid values we decided that future datasets would be
unlikely to have a large amount of invalid data, and so we decided to
remove any invalid rows from the data set.
If there were a large number of invalid rows, this could cause issues as the
sample used for analysis may not be fully representative of the original data,
and could lead us to draw invalid conclusions.


We wanted to make cleaning and verification data extensible to other data sets,
but our current way would need to be completely rewritten for a new data set
with new columns. Therefore, we developed `OptionEnum`, that extends `Enum`,
and stores a mapping of key to description. We can now easily work with the
data set, listing all possible values with their descriptions as well as parsing.

In [6]:
import MicroDataTeachingVars as md
[f"{x.key()}: {x.desc()}" for x in md.EthnicityOptions]

['1: White',
 '2: Mixed',
 '3: Asian or Asian British',
 '4: Black or Black British',
 '5: Chinese or Other ethnic group',
 '-9: No code required (Not resident in england or wales, students or schoolchildren living away during term-time)']

We can also use this to easily search for a particular value in the dataset


In [7]:
df.loc[df["Age"] == md.AgeOptions.FROM_35_TO_44.key()]

Unnamed: 0,Person ID,Region,Residence Type,Family Composition,Population Base,Sex,Age,Marital Status,Student,Country of Birth,Health,Ethnic Group,Religion,Economic Activity,Occupation,Industry,Hours worked per week,Approximated Social Grade
1,7394745,E12000001,H,5,1,1,4,1,2,1,1,1,2,1,8,6,4,3
2,7395066,E12000001,H,3,1,2,4,1,2,1,1,1,1,1,6,11,3,4
6,7394871,E12000001,H,5,1,2,4,3,2,1,2,1,1,1,6,11,2,3
18,7395059,E12000001,H,1,1,1,4,1,2,1,3,1,1,1,8,2,3,4
22,7394857,E12000001,H,2,1,1,4,2,2,1,1,1,1,1,8,2,3,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
569685,7944687,W92000004,H,2,1,1,4,1,2,1,2,1,2,1,8,4,3,4
569693,7945171,W92000004,H,2,1,2,4,2,2,1,1,1,2,1,4,4,2,2
569706,7946284,W92000004,H,1,1,2,4,1,2,1,1,1,3,1,3,11,3,2
569725,7945073,W92000004,H,1,1,2,4,1,2,1,1,1,2,1,4,11,3,2


And easily translate the cryptic key names into the descriptive strings

In [9]:
df["Residence Type"].apply(md.ResidenceOptions.parse)

0         Not resident in a communal establishment
1         Not resident in a communal establishment
2         Not resident in a communal establishment
3         Not resident in a communal establishment
4         Not resident in a communal establishment
                            ...                   
569736    Not resident in a communal establishment
569737    Not resident in a communal establishment
569738    Not resident in a communal establishment
569739    Not resident in a communal establishment
569740    Not resident in a communal establishment
Name: Residence Type, Length: 569741, dtype: object