CMPE251: Data analytics is a course taught at Queen's University by Dr. David Skillicorn. This is the capstone project for the course
# CMPE251: Analysis of Adulterant in Drug Shipments

## Problem Definition 
### Project Background
As illicit drugs go from production to wholesale to street sale, dealers often add adulterant to the shipments to increase volume and thus their profits. These adulterants act as a fingerprint/breadcrumbs that can help law enforcement trace the drugs to their country of origin and the paths they take to the consumer. 

This dataset describes the chemical composition of the adulterants added to the drug shipments found by customes. Each row of the dataset describes a shipment and each column denotes one adulterant detected. 

### Project Goals 
1. Cluster the samples according to their adulterant pattern, and gain insights into the possible pathways by which the drugs arrived at countries borders.
2. Create model that can predict a shipment label from adulerant profile. 

Some insights that can be gained are:
1. How many clearly defined pathways exist
2. How many big players vs small players exist
3. How much overlap are there between pathways

### Assumptions/Considerations 
1. If multiple shipments have the same adulterant pattern, they were presumably sourced from the same place. 
2. If all the patterns are different, then pipelines are complex and overlapping. 
3. Though it is not explicitly mentioned, from project description, the drugs of concern here are illicit drugs. 
4. The accuracy/breadth of the data here depends on Customs ability to seize and test drugs, meaning that different pathways may be less/more covert in detection leading to over/under-representation in the data.
5. Without knowing the country in which the drugs were seized, we cannot get a full picture of the which pathways are more resilient to detection.


# First Look At Data
Let's take a first look at the data. In this step we'll first look at the shape of the data, a list of the adulterants in the set and their respective frequencies. 

In [22]:
# Dataset manipulation
import pandas as pd
import numpy as np

# Data visualization 
import matplotlib.pyplot as plt
%matplotlib inline 
import  seaborn as sns

df = pd.read_csv('./drugsamples.csv')
print(df.shape)
print(df.head())
print(df.columns)
print(df.dtypes)


(2474, 58)
  PROMIS_Seizure_SeizureName  Ecgonine  Benzoylecgonine  N_Formylcocaine  \
0                  PS3080497      0.05             0.11             0.02   
1                  PS3080497      0.02             0.08             0.02   
2                  PS3080497      0.03             0.08             0.01   
3                  PS3080497      0.04             0.10             0.01   
4                  PS3080497      0.02             0.07             0.01   

   Ecgonine_methyl_ester  Norcocaine  cis_Cinnamoylcocaine  \
0                   0.02        0.01                  0.07   
1                   0.03        0.01                  0.08   
2                   0.03        0.01                  0.08   
3                   0.03        0.01                  0.08   
4                   0.03        0.01                  0.08   

   trans_Cinnamoylcocaine  Percent_Truxilline  Trimethoxycocaine  ...  \
0                    0.18              2.5975               0.20  ...   
1            

We see that there are 2474 observations representing drug shipments seized and 58 columns, 57 representing the chemical adulterant tested for and one column named **PROMIS_Seizure_SeizureName**. At this point in the investigation, data collected in this column is unknown. It may represent an identifier of the location or batch of drug shipments seized. By the background information provided, Each observation represents a unique shipment seized, but we can see from the from the output below that many different observations share the same string value in the column. A quick google search of the column name resulted in no information given. **An education guess from the name of the feature is that it represents an identifier of the "seizure operation" in which shipments were seized. Meaning that multiple shipments may be seized in one "incident".** I will follow-up with the course instructors. 

**Following up with course TA, the "PROMIS_Seizure_SeizureName" indicates the seizure operation. Meaning there may be multiple containers in one seizure. Thus different observations with the same value in this column represent different samples of the same seizure.**

In terms of the values in the chemical adulterant columns, the unit of measurement is not given, from a quick glance, it appears that they represent percentages of chemical composition. This will be explored in the Exploratory data analysis section.

We can also see that all the values in our dataset are numerical values except for trhe the first string "PROMIS_Seizure_SeizureName"

In [12]:
print(df['PROMIS_Seizure_SeizureName'].value_counts())

PS3187838    121
PS3098390     93
PS3239679     88
PS3087542     62
PS3247872     44
            ... 
PS3255752      1
PS3163708      1
PS3163700      1
PS3167858      1
PS3336931      1
Name: PROMIS_Seizure_SeizureName, Length: 465, dtype: int64


# Exploratory Data Analysis
Now that we've taken a first qualitative look at the data to understand the skeleton the problem, we will now take a qualitative dive into the dataset. **The goals of this section are to**:
1. Investigate the values held in each column
2. Investigate the presence of missing values in the dataset
3. Understand some summary statistics of the dataset (measures of central tendancy)
4. Investigate the distribution of the different adulterants 
5. Investigate correlation between features
* Investigate correlation between PROMIS_Seizure_SeizureName and other features 

We will go through these objectives sequentially and summerize findings at the bottom

## Investigate the value of each column 
### Investigate chemical compound attributes
Here we will invetigate the values held by our PROMIS_Seizure_SeizureName column and our compound columns separately. We begin with the compound columns. 

In [30]:
# Get dataframe without the PROMIS_Seizure_SeizureName column, since the rest are the compound values we want to investigare
df_compounds = df.drop(columns='PROMIS_Seizure_SeizureName')
df_compounds['sum'] = df_compounds.sum(axis=1)
print('Max: ' + str(df_compounds['sum'].max()))
print('Min: ' + str(df_compounds['sum'].min()))
print('Mean: ' + str(df_compounds['sum'].mean()))

Max: 190.99259999999995
Min: 1.7289999999999999
Mean: 112.33602421180274


To test the initial assumption that these each column held compound percentage of full shipment, all the values in the each shipment were summed up (barring the PROMIS_Seizure_SeizureName). To prove that they were percentages, we were looking for the sum of all rows features to be close to either 1 or 100. Instead, they range from ~1.73 to ~190.99, discounting this theory. 

Another assumption can be that each column represents some measure of mass or volume or concentration of chemical in the sample taken. Thus the sum of the features for each row would equal the total mass/volume of a shipment.

## Summary of Insights
1. The compound features do not represent a *percentage* of the shipment seized but may represent a portion of the mass/volume of the overall shipment.

### To Dos
* Reach out to data provider to inquire into the units represented in the data. 

# Clustering
Now we will work to cluster the data using various clustering techniques. This clustering analysis works on the assumption that **samples with similar chemical compositions share similar pathways**. Thus we can also assume the following about clusters: 
1. Different samples belonging to the same clusters may indicate that the samples had the same origin.
2. As we move outwards from the cluster centroid, points belonging to that cluster may have had the same or similar origins, but have taken different pathways to the place of seizure 
3. Two different clusters that overlap/intersect may denote a two different pathways that converge at some point
4. Big clusters may indicate a major drug pathway/big player
5. Small clusters/anomalies indicate smaller players



## Summary of Insights

## TODO
* Look into the different features that denote the most similarity in a cluster


# Data Engineering
## Summary of Insights

# Model Building
## Summary of Insights

# Results
## Summary of Insights