## Data Description
***
#### Origin: The dataset was obtained from DataFirst, run by the university of Cape Town
#### Content: The dataset consists of features of cannabis consumption by South Africans from various socio-economic groups, provinces, settlment types and racial groups. 
#### Date Collected: 2017
#### Collection Method: Primary data collection (method), it was collected via survery of cannabis users.
#### Dataset Size: 2241 datapoints (rows) and 21 features (columns)
#### Date Downloaded: 05/03/2021 (5th of March 2021)
***
## Inspecting the data to assess quality and to validlity

In [2]:
#NOTE: REMOVE IMPORTS AFTER COMBINING WITH GROUP
import pandas as pd
import numpy as np

dataset = pd.read_csv("cdpd-2017-csv.csv")
print("Checking data validity")
for i in range(20):
    print("----------------------------------------------------------------------------")
    col0 = dataset.iloc[:,i].unique()
    print(col0)


Checking data validity
----------------------------------------------------------------------------
['I consent']
----------------------------------------------------------------------------
['Male' 'Female' 'I would prefer not to answer' '.']
----------------------------------------------------------------------------
['Other' 'I would prefer to not answer' 'White' '.' 'African' 'Indian'
 'Coloured' 'Asian']
----------------------------------------------------------------------------
['< 20 years (younger than 20 years)' '30-39 years' '20-29 years'
 '50 years and above' '.' '40-49 years']
----------------------------------------------------------------------------
['R5000-R9999' 'R10 000- R14 999' '< R1000 (less than R1000)'
 'R20 000- R24 999' 'R1000- R4999' 'R15 000- R19 999' 'R30 000 and above'
 'R25 000- R29 999' '.']
----------------------------------------------------------------------------
['2-3 times per week' 'Everyday' '2-3 times per month' 'Once per week'
 'Once per month 

***
 -  #### Validations:
    -  The data's format is consistent as can be seen above.
    -  We can see that the surveryed people used "." in areas they chose not to answer - in the general case. For areas of numerics NaN is used.
    -  The columns variable types remained consistent in their respective columns and the shapes of the datapoints (the number of features) remained consistent.
***
#### Inspecting the data quality


In [4]:
print("####################################################################")   
print("Checking data quality")
print("####################################################################")   
for i in range(0,13):
    col = dataset.iloc[:,i].value_counts()
    print(col)
    print("----------------------------------------------------------------------------")

####################################################################
Checking data quality
####################################################################
I consent    2241
Name: q1, dtype: int64
----------------------------------------------------------------------------
Male                            1606
Female                           615
I would prefer not to answer      18
.                                  2
Name: q2, dtype: int64
----------------------------------------------------------------------------
White                           1544
Coloured                         251
African                          209
I would prefer to not answer     102
Indian                            88
Other                             28
.                                 11
Asian                              8
Name: q3, dtype: int64
----------------------------------------------------------------------------
20-29 years                           1439
30-39 years                        

***
 - #### Data Quality Analysis (first 14 questions):
    - The surveyed people in the dataset all consented which is what we would expect, as for their information to be kept this would be required.
    - The data contains very few NaNs (in number ranges) & very few places were no answer was chosen(not specified) represented by '.'. This shows decent quality in the first 14 data points
    - There is a mix of different ethnicities, genders and locations as we would expect for a survery that would represent South African users in a general sense.
    - The size of purchase (in grams) and the price of the entire purchase (ZAR) indicate that most people are consumers - since the price and quantity is smaller. It does seem however a few dealers may be in the dataset. This would make some sense given the survey was on cannabis use but not necessarily how they use it entirely and given that there are very few dealers in the set, this would make sense in a consumer supplier model.
    - The top two provinces make sense given their population size.
    
***
 -  #### Note to reader going further for q14_:
      - u indicates urban
      - r indicates residential
      - i indicates industrial
      - rur indicates rural
      - m indicates mixed
      - pnd indicates not disclosed
      
***

In [31]:
tot = 0
pnd = 0
for i in range(13,19):
    col = dataset.iloc[:,i].value_counts()
    print(col)
    if i != 18:
        tot = tot + col.to_numpy()[1]
    else:
        pnd = col.to_numpy()[1]
    print("----------------------------------------------------------------------------")
    
print("The total of the answered settlement types is " + str(tot))
tot = tot+pnd
print("Which with the people who chose not to answer gives a total of " + str(tot))

.       1658
true     583
Name: q14_u, dtype: int64
----------------------------------------------------------------------------
true    1155
.       1086
Name: q14_r, dtype: int64
----------------------------------------------------------------------------
.       2221
true      20
Name: q14_i, dtype: int64
----------------------------------------------------------------------------
.       2185
true      56
Name: q14_rur, dtype: int64
----------------------------------------------------------------------------
.       2000
true     241
Name: q14_m, dtype: int64
----------------------------------------------------------------------------
.       2074
true     167
Name: q14_pnd, dtype: int64
----------------------------------------------------------------------------
The total of the answered settlement types is 1986
Which with the people who chose not to answer gives a total of 2153


***
 - #### Further Data Quality Analysis on Settlements:
     - Most people live in places we would expect them to, so residential areas and urban zones
     - There does exist a bit of mismatch, as there are 88 people who did not answer this question at all, however this is a small proportion and is acceptable and can be dealt with in data wrangling
***

In [33]:
for i in range(19,27):
    col = dataset.iloc[:,i].value_counts()
    print(col)
    print("----------------------------------------------------------------------------")

Series([], Name: q15, dtype: int64)
----------------------------------------------------------------------------
10.0    633
8.0     433
7.0     269
5.0     208
9.0     207
6.0     152
4.0      96
2.0      90
3.0      89
1.0      59
Name: q16, dtype: int64
----------------------------------------------------------------------------
More than 30 days    312
30                   307
7                    282
5                    155
2                    148
14                   148
3                    147
10                   120
4                     89
1                     77
20                    66
25                    59
15                    57
21                    53
6                     52
8                     26
.                     24
12                    24
9                     14
28                    12
16                    10
24                     9
17                     9
22                     9
27                     6
13                     5
18              

***
- #### Data Quality Analysis Conclusion:
    - The data set seems to be of decent quality overal, with very few missing entries and a general catch of the users of cannabis.
    - Overall the quality of the data set is good and follows standard format with acceptable answers.
    
***
 - #### Ability to Answer Question:
     - The data set contains a varied crowd of users and has decent data quality, so general inferences can be made using the data set.
     - The data set also gives a general idea of consumption characterisitcs in the frequency, quantity, quality and prices of cannabis consumption by the average user.
     - Alongside the income ranges given in the data set, characterisitics can be found to answer and substantiate the focus question as well as other interesting questions that may be asked as further analysis is undertaken.