# Household factors leading to conviction of a crime

In [50]:
#import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.utils import resample
from statsmodels.stats.outliers_influence import variance_inflation_factor
import warnings
warnings.filterwarnings('ignore')
from IPython.display import clear_output

## Introduction
The general level of crime as estimated by VOCS has been declining during the past five years but increased in 2016/17 and 2017/18.

Household crimes increased by 5% to a total of 1,5 million incidences of crime while individual crime also increased by 5% to a total of 1,6 incidences, affecting 1,4 million individuals aged 16 and above. Northern Cape had the highest increase in both household and individual crimes. Housebreaking or burglary was the most dominant (54%) crime category among crimes measured by the Victims of Crime Survey (VOCS). An estimated total of 830 thousand incidences of housebreaking occurred in 2017/18, affecting 4,25% of all South African households. Nearly 32% of items stolen during housebreaking were clothes, followed by cellphones (24%) and food (22%).


### Research question

What variables affect conviction of a crime and how do they affect it?



### Hypothesis

This project uses data from Stats SA to predict number of crime convictions using household data (household income, distance to the nearest police station, living conditions etc.)

1) Determine the relationship between the number of crime convictions (Conviction of theft of personal property,Conviction of fraud,Conviction of assault etc.) and household living condtions (Type of toilet facility,Access to/use electricity,Victim of other crimes etc.). Does having no access to electricity lead to a person commiting more crimes? Is Conviction of assault result from being a victim of assault?

2) Determine if the a statistical relationship between any crime conviction to a persons living conditions, place of dwelling and overall community development. Perform statical modeling methods to predict the relationship


### Methodology
A brief overview of the methodology is as follows:<br/>
1. Read in data <br/>
2. Encode variables <br/>
3. Exploratory plots <br/>
4. Initial modelling <br/>
5. Feature selection using step-wise regression <br/>
6. Bootstrapping for feature selection and testing<br/>
7. Comparison of regression coefficient over the years <br/>

# 1. Read in the data

#### link to dataset: http://nesstar.statssa.gov.za:8282/webview/

* Collection method: Survey of 23380 household across all 9 provinces
* Date collected: April 10, 2017
* Date Downloaded: April 07, 2021
* Data size: 23380 rows, 307 columns

##### Q116PRESENT - Present household living standard
1 - Wealthy<br/>
2 - Very comfortable<br/>
3 - Reasonably comfortable<br/>
4 - Just getting along<br/>
5 - Poor<br/>
6 - Very poor<br/>
9 - Unspecified<br/>

#### Q51MAIN - Description of the main building

1 - Formal dwelling/house or brick/concrete block structure on a separate stand or yard or on a farm<br/>
2 - Traditional dwelling/hut/structure made of traditional materials<br/>
3 - Flat or apartment in a block of flats 580<br/>
4 - Cluster house in security complex 63 0.3%<br/>
5 - Town house (semi-detached house in a complex) 109 0.5%<br/>
6 - Semi-detached house 255 1.1%<br/>
7 - Formal dwelling/house/flat/room in backyard<br/>
8 - Informal dwelling/shack in backyard 913 3.9%<br/>
9 - Informal dwelling/shack not in backyard, e.<br/>

#### Q51NOOTH - No other dwelling occupied

1 - Yes <br/>
8 - Not applicable <br/>
9 - Unspecified <br/>

#### Q52AWALLS - Main material used for walls of the main dwelling

1 - Bricks <br/> 
2 - Cement/concrete <br/> 
3 - Corrugated iron/Zinc 
4 - Wood <br/> 
5 - Plastic <br/> 
6 - Cardboard <br/> 
7 - Mixture of mud and cement <br/> 
8 - Wattle and daub (e.g. sticks and mud) <br/> 
9 - Tiles <br/> 
10 - Mud <br/> 
11 - Thatch/grass<br/> 
12 - Asbestos <br/> 
13 - Other, specify <br/> 
99 - Unspecified <br/>

#### Q52AROOF - Main material used for the roof of the main dwelling

1 -  Bricks <br/>
2 - Cement/concrete <br/> 
3 - Corrugated iron/Zinc <br/>
4 - Wood <br/> 
5 - Plastic <br/>
6 - Cardboard <br/>
7 - Mixture of mud and cement <br/>
8 - Wattle and daub (e.g. sticks and mud) <br/>
9 - Tiles <br/>
10 - Mud <br/>
11 - Thatch/grass <br/>
12 - Asbestos <br/>
13 - Other, specify <br/>
99 - Unspecified <br/>

#### Q54ADWELLING - Government housing subsidy for other dwelling

1 - Yes <br/>
2 - No <br/> 
3 - Don’t know <br/>
9 - Unspecified <br/>
	
#### Q54AOTHER - Government housing subsidy for other dwelling

1 - Yes <br/>
2 - No <br/> 
3 - Don’t know <br/>
9 - Unspecified<br/>

#### Q56DRINK - Main source of water for drinking

1 - Piped (tap) water in the dwelling/house <br/>
2 - Piped (tap) water in yard  <br/>
3 - Borehole in yard  <br/>
4 - Rain-water tank in yard  <br/>
5 - Neighbour’s tap  <br/>
6 - Public/communal tap  <br/>
7 - Water-carrier/tanker
8 - Borehole outside yard  <br/> 
9 - Flowing water/stream/river  <br/>
10 - Stagnant water/dam/pool  <br/>
11 - Well  <br/>
12 - Spring  <br/>
13 - Other, specify  <br/>
99 - Unspecified <br/>

#### Q56OTHER - Main source of water for other use

1 - Piped (tap) water in the dwelling/house <br/>
2 - Piped (tap) water in yard <br/>
3 - Borehole in yard <br/>
4 - Rain-water tank in yard <br/>
5 - Neighbour’s tap <br/>
6 - Public/communal tap <br/>
7 - Water-carrier/tanker <br/>
8 - Borehole outside yard<br/>
9 - Flowing water/stream/river <br/>
10 - Stagnant water/dam/pool <br/>
11 - Well <br/>
12 - Spring <br/> 
13 - Other, specify <br/>
99 - Unspecified <br/>

#### Q518TOILET - Type of toilet facility

1 - Flush toilet connected to a public sewerage system <br/>
2 - Flush toilet connected to a septic tank <br/> 
3 - Chemical toilet <br/> 
4 - Pit latrine/toilet with ventilation pipe <br/> 
5 - Pit latrine/toilet without ventilation pipe <br/> 
6 - Bucket toilet (collected by municipality) <br/> 
7 - Bucket toilet (emptied by household) <br/>
8 - Ecological Sanitation Systems <br/>
9 - None <br/> 
10 - Other, specify <br/>
99 - Unspecified<br/>



#### Q524ELECT - Access to/use electricity

1 - Yes <br/>
2 - Not <br/>
9 - Unspecified <br/>

#### Q531POLICE - Distance to the nearest police station

1 - Less than 500m <br/>
2 - 500m – less than 1km <br/>
3 - 1km – less than 2km <br/>
4 - 2km – less than 5km <br/>
5 - 5km – less than 10km <br/>
6 - 10km – less than 20km <br/>
7 - 20km or more <br/>
8 - Not available <br/>
9 - Don’t know <br/>
99 - Unspecified <br/>

#### Q532FOOD - Mode of transport to the nearest food market

1 - Walking <br/>
2 - Taxi <br/> 
3 - Bus (public)<br/>
4 - Train <br/>
5 - Own transport <br/>
6 - Other, specify <br/>
9 - Unspecified<br/>

#### Q62OWNSHIP - Ownership of main dwelling

1 - Owned and fully paid off <br/>
2 - Owned, but not yet fully paid off, financed by a mortgage bond <br/>
3 - Owned, but not yet fully paid off, financed by another type of loan <br/>
4 - Rented as part of employment contract of household member <br/>
5 - Rented not as part of employment contract of household member <br/>
6 - Occupied rent-free as part of employment contract of household member <br/>
7 - Occupied rent-free not<br/>

#### Q651TOTROOMS - Total number of rooms
99 - Unspecified 

#### Q658VALUE - Estimated value of the dwelling

1 - R0 – R5 000 917 <br/>
2 - R5 001 – R10 000 <br/>
3 - R10 001 – R20 000 <br/>
4 - R20 001 – R50 000 <br/>
5 - R50 001 – R100 000 <br/>
6 - R100 001 – R250 000 <br/>
7 - R250 001 – R500 000 <br/>
8 - R500 001 – R1 000 000 <br/>
9 - R1 000 001 – R2 000 000 <br/>
10 - R2 000 001 – R3 000 000 <br/>
11 - R3 000 001 – R 4 000 000 <br/>
12 - More than R 4 000 000 <br/>
13 - Don’t know <br/>
88 - Not applicable <br/>
99 - Uspecified


#### Q61035POLICE - Police on the streets in the local area

1 - Have <br/>
2 - Don't have <br/>
3 - Don't Know <br/>
9 - Unspecified

#### Q224BNOMONEY5 - No money for 5 or more days

1 - Yes <br/>
2 - No <br/>
9 - Unspecified

#### Q2341SAVING - Household Savings

1 - Yes <br/>
2 - No <br/>
9 - Unspecified


#### province_code -  Province Code

1 - Western Cape <br/>
2 - Eastern Cape <br/>
3 - Northern Cape <br/>
4 - Free State <br/>
5 - KwaZulu-Natal <br/>
6 - North West <br/>
7 - Gauteng <br/>
8 - Mpumalanga <br/>
9 - Limpopo<br/>


#### SETTLEMENT_TYPE - Settlement Type

1 - Urban formal <br/>
2 - Urban informal <br/>
4 - Traditional area <br/>
5 - Rural formal<br/>



In [60]:
DF = pd.read_csv('LCS-2014-2015-HOUSEHOLD/LCS-2014-2015-HOUSEHOLD_F1.csv')
DF.head()

Unnamed: 0,UQNO,SURVEYDATE,Q11CPARTHH,Q11CMANY,Q15OTHPERS,Q116PRESENT,Q116PERSONNO,Q51MAIN,Q51OTHER,Q51NOOTH,...,income_pcp_quintile,expend_inkind_decile,expend_inkind_quintile,income_inkind_decile,income_inkind_quintile,Expenditure_weighted,Expenditure_inkind_weighted,Income_weighted,Income_inkind_weighted,hholds_wgt
0,813004940000021702,1032015,2,88,2,5,1,2,88,1,...,1,1,1,1,1,1516301.0,1516301.0,0.0,0.0,624.763809
1,607001710000011901,2042015,2,88,2,6,1,1,88,1,...,1,1,1,1,1,2596009.0,2596009.0,0.0,0.0,509.986828
2,774017360000010201,4102015,2,88,2,6,1,7,88,1,...,1,1,1,1,1,5079606.0,5079606.0,0.0,0.0,828.185078
3,236006020000000901,1022015,2,88,2,6,1,2,2,8,...,1,1,1,1,1,2105227.0,2105227.0,0.0,0.0,341.725292
4,773024020070002701,3072015,1,99,2,5,2,1,99,9,...,1,1,1,1,1,6038148.0,6038148.0,0.0,0.0,928.854284


In [77]:
# The column names that will be used for the dataset
DATA = DF[['Q116PRESENT','Q51MAIN','Q51NOOTH','Q52AWALLS','Q52AROOF','Q53WALLS',
           'Q53ROOF','Q54ADWELLING','Q161CPUBATT','Q161CPRIVATT','Q54AOTHER','Q56DRINK','Q56OTHER','Q518TOILET',
          'Q524ELECT','Q531POLICE','Q532FOOD','Q62OWNSHIP','Q651TOTROOMS','Q658VALUE',
          'Q61035POLICE','Q224BNOMONEY5','Q2341SAVING','province_code','SETTLEMENT_TYPE',
           'Ageofhead','hhsize','income']]


In [78]:
# The crime victims columns are in the range 142-159
# We are adding them to data.

# The total number of instances where one person has been a victim of crime is the sum of all specific instance 
# of a person being a victim.
# 2 represents no, 1 represents yes, 9 represents unspecified

victims = DF.iloc[:,141:159]
victims.replace({2:0,9:0},inplace=True)
# Total_IOPBV - total number of Instances Of Person Being a Victim
victims['Total_IOPBV'] = victims.sum(axis=1)

In [79]:
#T he crime conviction columns are in the range 160-177
# We are adding them to data.

convictions = DF.iloc[:,159:177]

#The total number of convictions is the sum of all specific convictions.
# 2 represents no, 1 represents yes, 9 represents unspecified
convictions.replace({2:0,9:0},inplace=True)
convictions['Total_number_of_convictions'] = convictions.sum(axis=1)

In [80]:
DATA = pd.concat([DATA,victims['Total_IOPBV']],axis=1)
DATA = pd.concat([DATA,convictions['Total_number_of_convictions']],axis = 1)

### Data cleaning 

The data contains 'Nan' values in the form<br/>

3 - Don't Know<br/>
6 - Other, specify<br/>
9 - Unspecified<br/>
10 - Other<br/>
13 - Don’t know/Other<br/>
88 - Not applicable<br/>
99 - Unspecified<br/>


Q51NOOTH(No other dwelling occupied) all the ' 8 ' (Not applicable) will be converted to ' 2 ' meaning No.

In [81]:
DATA_DIRTY = DATA.copy()
# We will be droping  rows due to the fact that they represent 'Nan'

# drop 3 - Don't Know
DATA = DATA.drop(DATA[(DATA.Q54ADWELLING == 3) | (DATA.Q54AOTHER == 3) | (DATA.Q61035POLICE == 3)].index)

# drop 6 - Other, specify
DATA = DATA.drop(DATA[(DATA.Q532FOOD == 6)].index)

# drop 9 - Unspecified
DATA = DATA.drop(DATA[(DATA.Q116PRESENT == 9) | (DATA.Q51NOOTH  == 9) | (DATA.Q54ADWELLING == 9) | (DATA.Q54AOTHER == 9 ) | (DATA.Q524ELECT  == 9) | (DATA.Q532FOOD == 9) | (DATA.Q61035POLICE == 9) | (DATA.Q224BNOMONEY5 == 9) | (DATA.Q2341SAVING == 9)].index)

# drop 10 - Other
DATA = DATA.drop(DATA[(DATA.Q518TOILET  == 10)].index)

# drop 13 - Don’t know/Other
DATA = DATA.drop(DATA[(DATA.Q658VALUE == 13) | (DATA.Q56OTHER == 13) | (DATA.Q56DRINK == 13) | (DATA.Q52AWALLS == 13) | (DATA.Q52AROOF == 13)].index)

# drop 88 - Not applicable
DATA = DATA.drop(DATA[(DATA.Q658VALUE == 88)].index)

# drop 99 - Unspecified
DATA = DATA.drop(DATA[(DATA.Q658VALUE == 99) | (DATA.Q56OTHER == 99) | (DATA.Q56DRINK == 99) | (DATA.Q52AWALLS == 99) | (DATA.Q52AROOF == 99) | (DATA.Q651TOTROOMS == 99) | (DATA.Q531POLICE == 99) | (DATA.Q518TOILET == 99)].index) 

# replacing 8 with a 2.
DATA['Q51NOOTH'].replace({8:2} , inplace = True)
DATA.reset_index(inplace = True)
DATA

Unnamed: 0,index,Q116PRESENT,Q51MAIN,Q51NOOTH,Q52AWALLS,Q52AROOF,Q53WALLS,Q53ROOF,Q54ADWELLING,Q161CPUBATT,...,Q61035POLICE,Q224BNOMONEY5,Q2341SAVING,province_code,SETTLEMENT_TYPE,Ageofhead,hhsize,income,Total_IOPBV,Total_number_of_convictions
0,70,6,9,1,4,3,1,1,2,2,...,1,1,2,8,5,77,5,3.622811e+02,0,0
1,72,6,9,1,3,3,1,1,2,0,...,2,2,2,7,4,52,5,5.330340e+02,0,0
2,73,6,9,2,3,3,2,2,2,0,...,2,1,2,8,2,55,3,3.553560e+02,0,0
3,74,6,9,1,3,3,2,3,2,1,...,1,1,2,1,1,49,3,3.637277e+02,1,0
4,75,5,8,1,3,3,2,2,2,0,...,2,8,2,7,2,41,3,3.637277e+02,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13006,23356,3,1,1,1,9,5,5,2,0,...,2,8,2,7,1,62,1,8.210731e+05,0,0
13007,23359,3,1,1,1,9,4,4,2,0,...,1,8,2,1,1,59,2,1.739965e+06,1,0
13008,23366,4,1,2,1,3,4,4,2,0,...,1,8,2,8,1,79,1,9.313694e+05,0,0
13009,23372,2,1,1,1,9,5,5,2,0,...,1,8,2,1,1,46,2,2.044899e+06,0,0


## Exploratory Analyses