## DATA CLEANING PROCESS
 1. Lower case all current column names
 2. Rename columns to new map
 3. Remove numbered prefixes and standardise sentence length elements in dataframe with regex
 4. Reorder columns
 5. Converting certain columns to `category` type
 6. Lower case all `outcomes`
 7. Filtering the dataset for output

In [1]:
#Importing libraries
import pandas as pd
import glob

I'm going to avoid changing the current `utilities.py` functions and instead re-define them here for testing first.

In [None]:
def loadData():
    path="data/external/obo_sent_pivot_2010_2022/"
    cols = ['Police Force Area', 'Year', 'Sex', 'Age group', 'Offence group', 'Sentence Outcome', 'Custodial Sentence Length','Sentenced']
    all_files = glob.glob(path + "*.csv")
    all_csvs = [pd.read_csv(filename, usecols=cols, encoding= 'unicode_escape', low_memory=False) for filename in all_files]
    return pd.concat(all_csvs, axis=0, ignore_index=True)

In [None]:
df=loadData()
df.head()

In [None]:
def lcColumns(x_df):
    '''
    Converting all `x_df` columns to lowercase and replacing spaces with underscores

    Parameters
    ----------
    x_df: Pandas dataframe
    '''
    x_df.columns = x_df.columns.str.lower().str.replace(' ', '_')
    return x_df

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(lcColumns)
)
df_cleaned.head()

In [None]:
def renameColumns(x_df):
    mapping={
    'offence_group': 'offence',
    'police_force_area': 'pfa',
    'sentence_outcome': 'outcome',
    'custodial_sentence_length': 'sentence_length',
    'sentenced': 'freq'
    }
    
    x_df = x_df.rename(columns=mapping)
    return x_df
    

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(lcColumns)
    .pipe(renameColumns)
)
df_cleaned.head()

In [None]:
def removePrefix(x_df):
    """
    Remove numbered prefixes from all elements in dataframe

    Parameters
    ----------
    x_df: Pandas dataframe

    Returns
    -------
    Dataframe
        Dataframe with regex parameters replaced
    """
    cols = x_df.select_dtypes(include='object').columns
    for col in cols:
        x_df[col] = x_df[col].str.replace('^\d+:', '', regex=True).str.lstrip()
    return x_df

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
)
df_cleaned.head()

In [None]:
def standardiseSentences(x_df, col):
    mapping = {r"^\S* - ": "",
        "(Over)": "More than",
        "(to less than)": "and under",
        "Life$": "Life sentence",
        }
    x_df[col].replace(regex=mapping, inplace=True)
    return x_df

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(standardiseSentences, 'sentence_length')
)
df_cleaned.head()

In [None]:
def orderColumns(x_df):
    '''
    Set column order for dataframe

    Parameters
    ----------
    x_df: Pandas dataframe
    '''
    column_order = ['year', 'pfa', 'sex', 'age_group', 'offence', 'outcome', 'sentence_length', 'freq']

    x_df = x_df.reindex(columns=column_order)
    return x_df

In [None]:
def removeTotal(x_df, col):
    x_df[col] = x_df[col].str.replace("Total ", "").str.capitalize()
    return x_df

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(standardiseSentences, 'sentence_length')
    .pipe(removeTotal, 'outcome')
    .pipe(orderColumns)
)
df_cleaned.head()

In [None]:
df_cleaned.info()

In [None]:
def categoryColumns(x_df):
    cols = x_df.select_dtypes(include='object').columns
    for col in cols:
        ratio = len(x_df[col].value_counts()) / len(x_df)
        if ratio < 0.05:
            x_df[col] = x_df[col].astype('category')
    return x_df

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(standardiseSentences, 'sentence_length')
    .pipe(removeTotal, 'outcome')
    .pipe(orderColumns)
    .pipe(categoryColumns)
)
df_cleaned.head()

In [None]:
df_cleaned.info()

Given how significant the difference in memory usage is once following the conversion of the `object` columns, I might change the order of the `categoryColumns()` function. Let's see if it makes a time difference.

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(categoryColumns)
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(standardiseSentences, 'sentence_length')
    .pipe(removeTotal, 'outcome')
    .pipe(orderColumns)
)
df_cleaned.head()

Right, that didn't quite work! Looking back it appears as though the `removePrefix()` function is the issue, as it specifies columns must be `object` type. Let's update this and run again.

In [None]:
def removePrefix(x_df):
    """
    Remove numbered prefixes from all elements in dataframe

    Parameters
    ----------
    x_df: Pandas dataframe

    Returns
    -------
    Dataframe
        Dataframe with regex parameters replaced
    """
    cols = x_df.select_dtypes(include='category').columns
    for col in cols:
        x_df[col] = x_df[col].str.replace('^\d+:', '', regex=True).str.lstrip()
    return x_df

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(categoryColumns)
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(standardiseSentences, 'sentence_length')
    .pipe(removeTotal, 'outcome')
    .pipe(orderColumns)
)
df_cleaned.head()

Much quicker!

I'm also going to loop back up and combine the next stage of `str.capitalize()` for the `outcome` column into the `removeTotal()` function.

I also suspect that this whole operation can be sped up by filtering the dataset as the first step. So let's define this next and then reorder the `pipe()` functions.

In [None]:
df.columns

In [None]:
df['Police Force Area'].unique()

In [None]:
def filterDataFrame(x_df):
    filt1 = x_df['Sex'] == '01: Female'
    filt2 = x_df['Sentence Outcome'].isin(['06: Total Immediate Custody', '04: Total Community sentence','05: Suspended Sentence'])
    filt3 = x_df['Age group'].isin(['02: Young adults', '03: Adults'])
    filt4 = x_df['Police Force Area'].isin(['City of London', 'Not known'])
    filt = filt1 & filt2 & filt3 & ~filt4
    women_dataset = x_df[filt].sort_values(['Year', 'Police Force Area']).copy()
    return women_dataset

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(filterDataFrame)
    .pipe(categoryColumns)
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(standardiseSentences, 'sentence_length')
    .pipe(removeTotal, 'outcome')
    .pipe(orderColumns)
)
df_cleaned.head()

Wow, that was quick! I'm going to need to double check this has completed all of the stages.

What has been completed?
* It's filtered
* Columns are lowercase
* Columns are renamed
* Prefixes are removed
* Total has been removed from `outcome`
* Column order has been changed

What needs further checking?
* Have `object` columns been changed to `category`
* Have `sentence_length` values been standardised

In [None]:
df_cleaned.info()

Hmmmmm, no to the first question then!

What's gone on here?

Let's adjust the `ratio` component of the `categoryColumns` function and see if that makes a difference

In [None]:
def categoryColumns(x_df):
    cols = x_df.select_dtypes(include='object').columns
    for col in cols:
        ratio = len(x_df[col].value_counts()) / len(x_df)
        if ratio < 0.5:
            x_df[col] = x_df[col].astype('category')
    return x_df

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(filterDataFrame)
    .pipe(categoryColumns)
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(standardiseSentences, 'sentence_length')
    .pipe(removeTotal, 'outcome')
    .pipe(orderColumns)
)
df_cleaned.head()

In [None]:
df_cleaned.info()

Nope, that hasn't made a difference

In [None]:
def filterDataFrame(x_df):
    filt1 = x_df['Sex'] == '01: Female'
    filt2 = x_df['Sentence Outcome'].isin(['06: Total Immediate Custody', '04: Total Community sentence','05: Suspended Sentence'])
    filt3 = x_df['Age group'].isin(['02: Young adults', '03: Adults'])
    filt4 = x_df['Police Force Area'].isin(['City of London', 'Not known'])
    filt = filt1 & filt2 & filt3 & ~filt4
    return x_df[filt].sort_values(['Year', 'Police Force Area'])

Let's run this again with a slight change to the return statement of `filterDataFrame()`

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(filterDataFrame)
    .pipe(categoryColumns)
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(standardiseSentences, 'sentence_length')
    .pipe(removeTotal, 'outcome')
    .pipe(orderColumns)
)
df_cleaned.head()

In [None]:
def categoryColumns(x_df):
    cols = x_df.select_dtypes(include='object').columns
    for col in cols:
        ratio = len(x_df[col].value_counts()) / len(x_df)
        if ratio < 0.05:
            x_df[col] = x_df[col].astype('category')
    return x_df

In [None]:
cols = df_cleaned.select_dtypes(include='object').columns
for col in cols:
    ratio = len(df_cleaned[col].value_counts()) / len(df_cleaned)
    print(f'Column: {col}, Ratio: {ratio:.20f}')

In [None]:
cols = df.select_dtypes(include='object').columns
for col in cols:
    ratio = len(df[col].value_counts()) / len(df)
    print(f'Column: {col}, Ratio: {ratio:.20f}')

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(filterDataFrame)
    .pipe(categoryColumns)
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix) # It appears to be at this stage that the categorised columns change back to object as a result of the `.replace()` function
    
)
df_cleaned.info()

Okay, so let's adjust the `removePrefix()` function and change some of the ordering so `categoryColumns()` comes after.

In [None]:
def removePrefix(x_df):
    """
    Remove numbered prefixes from all elements in dataframe

    Parameters
    ----------
    x_df: Pandas dataframe

    Returns
    -------
    Dataframe
        Dataframe with regex parameters replaced
    """
    cols = x_df.select_dtypes(include='object').columns
    for col in cols:
        x_df[col] = x_df[col].str.replace('^\d+:', '', regex=True).str.lstrip()
    return x_df

In [None]:
my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(filterDataFrame)
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(removeTotal, 'outcome')
    .pipe(categoryColumns)
    .pipe(standardiseSentences, 'sentence_length')
    .pipe(orderColumns)
    
)
df_cleaned.info()

Right, I think we're there. Let's consolidate.

In [None]:
.pipe(filterDataFrame)
.pipe(lcColumns)
.pipe(renameColumns)
.pipe(removePrefix)
.pipe(removeTotal, 'outcome')
.pipe(categoryColumns)
.pipe(standardiseSentences, 'sentence_length')
.pipe(orderColumns)

In [2]:
def loadData():
    path="data/external/obo_sent_pivot_2010_2022/"
    cols = ['Police Force Area', 'Year', 'Sex', 'Age group', 'Offence group', 'Sentence Outcome', 'Custodial Sentence Length','Sentenced']
    all_files = glob.glob(path + "*.csv")
    all_csvs = [pd.read_csv(filename, usecols=cols, encoding= 'unicode_escape', low_memory=False) for filename in all_files]
    return pd.concat(all_csvs, axis=0, ignore_index=True)

def filterDataFrame(x_df):
    filt1 = x_df['Sex'] == '01: Female'
    filt2 = x_df['Sentence Outcome'].isin(['06: Total Immediate Custody', '04: Total Community sentence','05: Suspended Sentence'])
    filt3 = x_df['Age group'].isin(['02: Young adults', '03: Adults'])
    filt4 = x_df['Police Force Area'].isin(['City of London', 'Not known'])
    filt = filt1 & filt2 & filt3 & ~filt4
    return x_df[filt].sort_values(['Year', 'Police Force Area'])

def lcColumns(x_df):
    '''
    Converting all `x_df` columns to lowercase and replacing spaces with underscores

    Parameters
    ----------
    x_df: Pandas dataframe
    '''
    x_df.columns = x_df.columns.str.lower().str.replace(' ', '_')
    return x_df

def renameColumns(x_df):
    mapping={
    'offence_group': 'offence',
    'police_force_area': 'pfa',
    'sentence_outcome': 'outcome',
    'custodial_sentence_length': 'sentence_length',
    'sentenced': 'freq'
    }
    
    x_df = x_df.rename(columns=mapping)
    return x_df

def removePrefix(x_df):
    """
    Remove numbered prefixes from all elements in dataframe

    Parameters
    ----------
    x_df: Pandas dataframe

    Returns
    -------
    Dataframe
        Dataframe with regex parameters replaced
    """
    cols = x_df.select_dtypes(include='object').columns
    for col in cols:
        x_df[col] = x_df[col].str.replace('^\d+:', '', regex=True).str.lstrip()
    return x_df
    
def removeTotal(x_df, col):
    x_df[col] = x_df[col].str.replace("Total ", "").str.capitalize()
    return x_df

def categoryColumns(x_df):
    cols = x_df.select_dtypes(include='object').columns
    for col in cols:
        ratio = len(x_df[col].value_counts()) / len(x_df)
        if ratio < 0.05:
            x_df[col] = x_df[col].astype('category')
    return x_df

def standardiseSentences(x_df, col):
    mapping = {r"^\S* - ": "",
        "(Over)": "More than",
        "(to less than)": "and under",
        "Life$": "Life sentence",
        }
    x_df[col].replace(regex=mapping, inplace=True)
    return x_df

def orderColumns(x_df):
    '''
    Set column order for dataframe

    Parameters
    ----------
    x_df: Pandas dataframe
    '''
    column_order = ['year', 'pfa', 'sex', 'age_group', 'offence', 'outcome', 'sentence_length', 'freq']

    x_df = x_df.reindex(columns=column_order)
    return x_df

In [5]:
df=loadData()

my_df = df.copy()
df_cleaned=(
    my_df
    .pipe(filterDataFrame)
    .pipe(lcColumns)
    .pipe(renameColumns)
    .pipe(removePrefix)
    .pipe(removeTotal, 'outcome')
    .pipe(standardiseSentences, 'sentence_length')
    .pipe(categoryColumns)
    .pipe(orderColumns)   
)
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 243344 entries, 1915 to 2533630
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype   
---  ------           --------------   -----   
 0   year             243344 non-null  int64   
 1   pfa              243344 non-null  category
 2   sex              243344 non-null  category
 3   age_group        243344 non-null  category
 4   offence          243344 non-null  category
 5   outcome          243344 non-null  category
 6   sentence_length  243344 non-null  category
 7   freq             243344 non-null  int64   
dtypes: category(6), int64(2)
memory usage: 7.0 MB


In [6]:
df_cleaned.head()

Unnamed: 0,year,pfa,sex,age_group,offence,outcome,sentence_length,freq
1915,2010,Avon and Somerset,Female,Young adults,Violence against the person,Community sentence,Not known,1
1997,2010,Avon and Somerset,Female,Young adults,Drug offences,Community sentence,Not known,1
2052,2010,Avon and Somerset,Female,Young adults,Violence against the person,Immediate custody,Life sentence,1
2053,2010,Avon and Somerset,Female,Young adults,Violence against the person,Community sentence,Not known,1
2054,2010,Avon and Somerset,Female,Young adults,Violence against the person,Suspended sentence,Not known,1


In [7]:
df_cleaned.to_csv('data/interim/new_pipeline.csv', index=False)

In [8]:
df_cleaned_import = pd.read_csv('data/interim/new_pipeline.csv')
df_cleaned_import

Unnamed: 0,year,pfa,sex,age_group,offence,outcome,sentence_length,freq
0,2010,Avon and Somerset,Female,Young adults,Violence against the person,Community sentence,Not known,1
1,2010,Avon and Somerset,Female,Young adults,Drug offences,Community sentence,Not known,1
2,2010,Avon and Somerset,Female,Young adults,Violence against the person,Immediate custody,Life sentence,1
3,2010,Avon and Somerset,Female,Young adults,Violence against the person,Community sentence,Not known,1
4,2010,Avon and Somerset,Female,Young adults,Violence against the person,Suspended sentence,Not known,1
...,...,...,...,...,...,...,...,...
243339,2022,Wiltshire,Female,Adults,Fraud Offences,Suspended sentence,Not known,1
243340,2022,Wiltshire,Female,Adults,Summary non-motoring,Community sentence,Not known,1
243341,2022,Wiltshire,Female,Adults,Summary motoring,Community sentence,Not known,1
243342,2022,Wiltshire,Female,Adults,Miscellaneous crimes against society,Community sentence,Not known,1


In [9]:
df_original = pd.read_csv('data/interim/PFA_2010-22_women_cust_comm_sus.csv')
df_original

Unnamed: 0,year,pfa,sex,age_group,offence,outcome,sentence_length,freq
0,2010,Avon and Somerset,Female,Young adults,Violence against the person,Community sentence,24:Not known,1
1,2010,Avon and Somerset,Female,Young adults,Drug offences,Community sentence,24:Not known,1
2,2010,Avon and Somerset,Female,Young adults,Violence against the person,Immediate custody,Life sentence,1
3,2010,Avon and Somerset,Female,Young adults,Violence against the person,Community sentence,24:Not known,1
4,2010,Avon and Somerset,Female,Young adults,Violence against the person,Suspended sentence,24:Not known,1
...,...,...,...,...,...,...,...,...
243339,2022,Wiltshire,Female,Adults,Fraud Offences,Suspended sentence,24:Not known,1
243340,2022,Wiltshire,Female,Adults,Summary non-motoring,Community sentence,24:Not known,1
243341,2022,Wiltshire,Female,Adults,Summary motoring,Community sentence,24:Not known,1
243342,2022,Wiltshire,Female,Adults,Miscellaneous crimes against society,Community sentence,24:Not known,1


Great, that looks fine. Now to replace elements in `utilities.py` and `data_cleansing.py`