<a href="https://colab.research.google.com/github/devorahst/Test/blob/main/SAKFeatureAssociationWithNDIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Feature Association for Variables Impacting NDIS Eligibility**
This dataset contains information from 5,643 sexual assault kits (SAK) collected within the state of Utah. Each row provides information for a single kit while column values are features that indicate a patient's response to questions regarding their specific case. The aim of this script is to sort through these features and determine which are significant in determining whether a SAK will produce a National DNA Index System (NDIS) eligible profile. Determining significance will be done by running Chi-Square, Anova, and T-Tests, depending on wheter the data is numerical or catagorical.

# Part 1. Set-Up
First, we must download the dataset. Upon running the cell, you will be prompted to login to your Gmail account. You will then be provided with a one-time use code to copy and paste into the slot below. After hitting enter, the dataset will load into this script.

*See [IO Notebook](https://colab.research.google.com/drive/1fuF8iahEqBFV62Y6OoiEViUqHo-DbXrT) for more information about set-up and interacting with our dataset

In [None]:
#pulls up our SAK dataset
#@title Upload Dataset
file_id = "13DLmmbYXonl9alHR4VobfeTuA_IxlQxZ" #@param {type:"string"}
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

from google.colab import auth
auth.authenticate_user()

from googleapiclient.discovery import build
drive_service = build('drive', 'v3')

import io
from googleapiclient.http import MediaIoBaseDownload

request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
  _, done = downloader.next_chunk()

fileId = drive.CreateFile({'id': file_id }) #DRIVE_FILE_ID is file id example: 1iytA1n2z4go3uVCwE_vIKouTKyIDjEq
print(fileId['title'])  
fileId.GetContentFile(fileId['title'])  # Save Drive file as a local file

!pip install -U scikit-learn

from scipy.stats import chi2_contingency
from scipy import stats
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# To change scientific numbers to float
np.set_printoptions(formatter={'float_kind':'{:f}'.format})

with open(fileId['title'], encoding="utf8", errors='ignore') as f:
  df = pd.read_csv(f)

df = df.apply(pd.to_numeric, errors='ignore')
predictedVariable = "CODISNDISeligibleProfile"


#delete rows that don't have a value for CODISNDISeligibleProfile

predictedFeatures = ['CODISNDISeligibleProfile', 'SDISeligibleprofile']  

numericalFeatures = ['Age', 'Timebetweenassaultandexaminhours', 'PainLevel', 'MulitipleSuspectNumber', 
                     'NumberofUnknownresponses', 'NumberAssaultiveActs', 'Numberofphysicalinjuries', 'Numberofgentialinjuries',
                     'NumberOFitemsTested', 'TimeBetweenCollectAndDNAext', 'TimeBetweenSubmissionANDtesting', 'NumberOfswabsQuantMaleDNA',
                     'NumberOfswabsDNAanalysis', 'NumberofSTRDNAloci', 'NumberOFswabsSTRDNAprofile', 'NumberOfYSTRDNAloci']

categoricalFeatures = ['Site', 'EXAMbySANE', 'YearKitCollected', 'KITbroughtTOcrimelab', 'KITlengthofSubmissionTime',
                       'UnderAge18', 'Gender', 'ExamDeclined', 'Noninterview', 'Race', 'PriorHxofSAover14',
                       'PriorHxofSAunder14', 'Student', 'Military', 'Pain', 'PainLocation1','PainLocation2', 
                       'PainLocation3', 'PainLocation4','PainTreatment', 'PermanentAddress', 'CurrentPhysicalmedprob',
                       'MedProbChronic', 'MedProbInfection', 'MedProbBlood', 'MedProbCardiac', 'MedProbEar', 'MedProbEndocrine',
                       'MedProbEye', 'MedProbGI', 'MedProbGU', 'MedProbGYN', 'MedProbImmune', 'MedProbMusculoskeletal', 'MedProbNeurological',
                       'MedProbOral', 'MedProbRenal', 'MedProbRespiratory', 'MedProbSkin', 'MedProbOther', 'Medication',
                       'PsychotropicMEDuse', 'PsychotropicANTIPSYCHOTICSatypical', 'PsychotropicSTIMULANTuse', 'PsychotropicANTIANXIETY', 
                       'PsychotropicANTIDEPRESSANTS', 'PsychotropicANTISEIZUREbipolar', 'PsychotropicADDICTIONmeds','PsychotropicSLEEPaid', 'PsychotropicOTHER', 
                       'PsychotropicANTIPSYCHOTICStypical', 'PolypharmacyPsychMeds', 'ImmunizationstatusTETANUS', 'ReceivedTetanus',
                       'ImmunizationstatusHEP', 'ReceivedHepB', 'Sexualcontactwithin120hours', 'Selfdisclosurementalillness', 'MIdepression',
                       'MIanxiety', 'MIPTSD', 'MIpsychoticDisorders', 'MIadhd', 'MIpersonalitydisorder', 'MIbipolar', 'MIeatingdisorder', 'MIdrugalcoholdisorders', 
                       'MIother', 'SelfDiscolsureMentalillnessORuseofpsychotropics', 'OnlineMeetingOFsuspect', 'Suspectrelationship',
                       'Locationofassault', 'PatientActionScratch', 'PatientActionBite', 'PatientActionHit', 'PatientActionKick', 'PatientActionOther',
                       'Suspectrace', 'SuspectactionVERBAL', 'SuspectactionsGRABBEDHELD', 'SuspectactionsPHYSICALBLOWS', 'SuspectactionsSTRANGLEDCHOKED',
                       'SuspectactionsWEAPON', 'SuspectactionsRESTRAINTS', 'SuspectactionsBURNED', 'MultipleSuspects', 'SuspectedDrugfacilitated',
                       'Patientdruguse', 'PatientETOHuse', 'Suspectdruguse', 'SuspectETOHuse', 'PatientSuspectETOHordrug', 'LossOFconsciousnessORawareness',
                       'OneORmoreunknownanswer', 'Unknownanswerto4ormorequestions', 'UnknownanswertoALL', 'AsleepANDawakenedtoassault', 'MemoryLoss',
                       'LossOfconsciousness', 'DecreasedAwareness', 'TonicImmobility', 'Detachment', 'NOSApatientsVAGINApenis', 'NOSApatientsVAGINAfingerhand',
                       'NOSApatientsVAGINAmouth', 'NOSApatientsVAGINAobject', 'NOSApatientsANUSpenis', 'NOSApatientsANUSfingerhand', 'NOSApatientsANUSmouth', 
                       'NOSApatientsANUSobject', 'NOSApatientsPENISgenitals', 'NOSApatientsPENISfinger', 'NOSApatientsPENISmouth', 'NOSApatientsPENISobject', 
                       'NOSApatientsMOUTHpenis', 'NOSApatientsMOUTHfinger', 'NOSApatientsMOUTHmouth', 'NOSApatientsMOUTHobject', 'SUSPECTmouthcontactGENITALS', 
                       'SUSPECTmouthcontactMOUTH', 'SUSPECTmouthcontactOTHER', 'SUSPECTmouthcontactOTHERsite', 'HANDSofSuspectBreast', 'HANDSofSuspectExtremities', 
                       'HANDSofSuspectOther', 'Ejaculation', 'CONDOMuse', 'LUBRICATIONuse', 'SuspectWASHEDpatient', 'SuspectINJUREDbypatient', 'PostassaultURINATED', 
                       'PostassaultDEFECATED', 'PostassaultDOUCHED', 'PostassaultVOMITED', 'PostassaultGARGLED', 'PostassaultBRUSHEDTEETH', 'PostassaultATEdrank', 
                       'PostassaultBATHED', 'PostassaultGENITALWIPE', 'PostassaultCHANGEDCLOTHING', 'PostassaultREMOVEDInserted', 'PhysicalORmentalimpairment', 'Physicalinjury', 
                       'LPIhead', 'LPIneck', 'LPIbreasts', 'LPIchestback', 'LPIabdomen', 'LPIextremities', 'TPIlaceration', 'TPIecchymosis', 'TPIabrasion', 'TPIredness', 
                       'TPIswelling', 'TPIbruise', 'TPIpetechiae', 'TPIincision', 'TPIavulsion', 'TPIdiscoloredmark', 'TPIpuncturewound', 'TPIfracture', 
                       'TPIbitemark', 'TPIburn', 'TPImissingorbrokenTEETH', 'TPIconjunctivalhemorrhage', 'Genitalinjury', 'LGIinnerthighs', 'LGIclitoralhoodclitoris', 
                       'LGIlabiamajora', 'LGIlabiaminora', 'LGIperiurethraltissueURETHRA', 'LGIperihymenaltissue', 'LGIhymen', 'LGIvagina', 'LGIcervix', 'LGIfossanavicularis', 
                       'LGIposteriorfourchette', 'LGIperineum', 'LGIperineum', 'LGIanalrectal', 'LGIbuttocks', 'LGImalePerianalperineum', 'LGIglanspenis', 'LGIpenileshaft', 
                       'LGImaleURETHRALmeatus', 'LGIscrotum', 'LGItestes', 'LGImaleanus', 'LGImalerectum', 'TGIlaceration', 'TGIecchymosis', 'TGIabrasion', 'TGIredness', 
                       'TGIswelling', 'TGIbruise', 'TGIpetechiae', 'TGIincision', 'TGIavulsion', 'TGIdiscoloredmark', 'TGIpuncturewound', 'ToludineDYEuptake', 'HIVnPEP', 
                       'UQuikcollected', 'Yscreen', 'NumberItemsWITH3cutoff', 'ItemsAnalyzed1', 'ItemsAnalyzed2', 'ItemsAnalyzed3', 'ItemsAnalyzed4', 'ItemsAnalyzed5', 
                       'ItemsAnalyzed6', 'ItemsAnalyzed7', 'ItemsAnalyzed8', 'ItemsAnalyzed9', 'ItemsAnalyzed10', 'TypesOFitemsTested', 'RandomSample20142015', 
                       'YearofDNAextraction', 'LocationOfTesting','DANYfundedSAK', 'DNAKitUsed', 'SerologyDoneBeforeDNA', 'QuantMaleDNAFound', 'QuantMaleSwabLoc1', 
                       'QuantMaleSwabLoc2', 'QuantMaleSwabLoc3', 'QuantMaleSwabLoc4', 'QuantMaleSwabLoc5', 'ProbableSTRDNAprofileOFsuspect', 'ProfileofSTRDNAloci', 'ProbableYSTRDNAprofile', 'ProfileOfYSTRDNAloci', 
                       'SwabLocationYSTRDNA', 'SecondSwabLocationYSTRDNA', 'SwabFromSuspectwithVictimDNA', 'ExcludeSuspect', 'ConsensualPartnerStandardSubmitted', 
                       'STRDNAProbableprofileTYPE', 'CODISprofileHit', 'STRDNAkitUsed', 'SUSPECTmouthcontactBREASTS', 'Swab1LocationSTRDNAprofile', 'Swab2LocationSTRDNAprofile',
                       'Swab3LocationSTRDNAprofile', 'SuspectStandardSubmitted', 'CODISNDISreasons', 'CODISSDISreasons']

#unusedFeatures and stringFeatures are columns that contain data that was relevant to medical professionals and for legal purposes, 
#but that aren't useful for our feature association or for predicting eligibility
unusedFeatures = ['filter_$', 'PainTreatmentYesNo', 'GenderMaleFemale', 'DVsuspect', 'RacePrimaryGroups', 'IPSAcombined', 'STRDNAcompleted', 
                  'PhysicalInjuryNOunknown', 'GenitalInjuryNOunknown']

stringFeatures = ['DeIdentifiedCase', 'Raceother', 'SchoolName', 'MilitaryBranchName', 'AddressIfnotPermanent', 'Currentmedprobtext',
                  'MedProbOtherText', 'Medicationtext', 'Sexualcontactwithin120hoursTYPE', 'SelfdisclosureMItype', 'OnlineMeetingName', 'SuspectrelationshipOTHER',
                  'LocationofassaultOTHER', 'Surfaceofassault', 'PatientActionOtherTEXT', 'SuspectraceOTHER', 'SuspectOTHERactions', 'NOSApatientsVAGINAobjectdescription',
                  'NOSApatientsANUSobjectdescription', 'NOSApatientsPENISobjectdescription', 'NOSApatientsMOUTHobjectdescription', 'EjaculationSITE', 'LUBRICATIONtype',
                  'SuspectINJUREDbypatientexplanation', 'Impairmentdescription', 'UBFSnumber', 'ISPnumber', 'DateSubmittedUBFS', 'DateofDNAextractionReport',
                  'BodySwabLocQuant', 'BodySwabDNAanalysis', 'BodySwabLocationSTRDNA', 'BodySwabYSTRDNA', 'ISPnotes2020', 'UBFSnotes2020', 'UBFSnotes2018', 'UBFSnotes2014']

dummy1 = pd.get_dummies(df['Swab1ToDNAanalysis'])
dummy2 = pd.get_dummies(df['Swab2ToDNAanalysis'])
dummy3 = pd.get_dummies(df['Swab3ToDNAanalysis'])
dummy4 = pd.get_dummies(df['Swab4ToDNAanalysis'])

dummy4['0'] = dummy1['0']
dummy1['11'] = dummy2['11']
dummy3['0'] = dummy1['0']

newDummy = dummy1


newDummy = newDummy.where(newDummy != 0, dummy2)
newDummy = newDummy.where(newDummy != 0, dummy3)
newDummy = newDummy.where(newDummy != 0, dummy4)

# print(newDummy)
df['SwabToDNAanalysisNoquantmaleDNAfound'] = newDummy['0'].astype('category')
df['SwabToDNAanalysisVaginal'] = newDummy['1']
df['SwabToDNAanalysisCervical'] = newDummy['2']
df['SwabToDNAanalysisPerianal'] = newDummy['3']
df['SwabToDNAanalysisRectal'] = newDummy['4']
df['SwabToDNAanalysisOral'] = newDummy['5']
df['SwabToDNAanalysisBody'] = newDummy['6']
df['SwabToDNAanalysisUnderwear'] = newDummy['7']
df['SwabToDNAanalysisOtherClothing'] = newDummy['8']
df['SwabToDNAanalysisBedding'] = newDummy['9']
df['SwabToDNAanalysisCondom'] = newDummy['10']
df['SwabToDNAanalysisTampon'] = newDummy['11']

swabToDNAFeatures = ['SwabToDNAanalysisNoquantmaleDNAfound', 'SwabToDNAanalysisVaginal', 'SwabToDNAanalysisCervical', 'SwabToDNAanalysisPerianal', 'SwabToDNAanalysisRectal', 
                     'SwabToDNAanalysisOral','SwabToDNAanalysisBody', 'SwabToDNAanalysisUnderwear', 'SwabToDNAanalysisOtherClothing', 'SwabToDNAanalysisBedding', 
                     'SwabToDNAanalysisCondom','SwabToDNAanalysisTampon' ]

categoricalFeatures.extend(['SwabToDNAanalysisNoquantmaleDNAfound', 'SwabToDNAanalysisVaginal', 'SwabToDNAanalysisCervical', 'SwabToDNAanalysisPerianal', 'SwabToDNAanalysisRectal', 
                     'SwabToDNAanalysisOral','SwabToDNAanalysisBody', 'SwabToDNAanalysisUnderwear', 'SwabToDNAanalysisOtherClothing', 'SwabToDNAanalysisBedding', 
                     'SwabToDNAanalysisCondom','SwabToDNAanalysisTampon'])

df = df.replace(r'^\s+$', np.nan, regex=True)
df = df.replace({np.nan: "No Response"})
df = df.applymap(str)
df = df[df[predictedVariable] != "No Response"]

#Code to filter out all other genders
# df = df[df['Gender'] == '1'] #dataframe containing information from only female respondents

df = df[df['Site'] != '6'] #remove idaho data

MasterValentine_UpdatedCODIS_Feb12_2021.csv
Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 32.3 MB/s 
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.2.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.2 threadpoolctl-2.2.0


  interactivity=interactivity, compiler=compiler, result=result)


#Part 2. Find Association With CODIS Eligibility
Now that we have loaded and cleaned our dataset, we can use a couple different tools to find the association of each of our features with the variable of an NDIS eligible profile. Because our dataset has both numerical and categorical features, we have to take different approaches to find this association. As such, this section is broken into finding association through the use of Chi-Square Test of Independence on categorical features and ANOVA testing on numerical features. 
1. Run Chi-Square on Categorical Features
2. Run Anova on Numerical Features

## Run Chi-Square on Categorical Features
We will run Chi-Square tests of Independence on all categorical features verus the feature we are trying to predict (in this case, CODISNDISeligibleProfile). These tests will return a P-value that we can compare to our established Bonferroni correction cutoff. To run our tests, we will create a contingency table for each feature compared to our predicted feature. We then pass this contingency into the chi2_contingency function from the scipy package we included earlier.

*For an in-depth walkthrough of Chi-Square, see our [Chi-Square Example](https://colab.research.google.com/drive/1KpIErpzgsCY7d2sa9j8J9QNIo9HvaWUW#scrollTo=fOSTbNFFiBrW)

###Run Chi-Square Test Using Gender
If we break down our dataset by gender, it is evident that the majority of our data comes from female victims. Let's delve deeper into this feature by running chi-square with gender against CODIS eligibility. 

In [None]:
#Code to run Gender against Codis
newContingency= pd.crosstab(df['Gender'], df[predictedVariable])
c, p, dof, expected = chi2_contingency(newContingency) 
print("The p-value obtained was " + str(p) + " which is significant")

The p-value obtained was 0.0004294736932671727 which is significant


Based on the p-value we received for our chi-square comparing gender to CODIS eligibility, it is clear that gender is significant in determining whether an eligible profile is produced. Now let's look closer at our contingency table for gender.

In [None]:
#Contingency plot of Gender and Eligibility
print(newContingency)

CODISNDISeligibleProfile     0     1
Gender                              
1                         2901  1503
2                          156    37
3                            8     5
4                            7     1
5                            4     0
No Response                  3     1


Our contingency plot clearly shows that most of our data comes from female respondents. In order to keep our analysis centralized, we will only analyze our dataset for female respondents. The next cell of code will filter out all other recorded gender types (male, transgender, intersex, and no response).

In [None]:
#Code to filter out all other genders
df = df[df['Gender'] == '1'] #dataframe containing information from only female respondents
df.to_csv("femaledf.csv", index=False)

###Run Chi-Square Test on all Categorical Features
We will loop through each feature and sort it into a seperate table depending on its significance towards obtaining a CODISNDIS eligible profile. These (significant and non-significant tables) will be output in the result section.

In [None]:
"""
Performs Chi-Square Test of Independence on contingency table and places the obtained p-value into a table based on significance
  @param contigency: table comparing values from one feature to another
  @param column: string name of the current column being compared to CODISNDISeligibleProfile
  @return: void
"""
def runChi(contigency, column):
  c, p, dof, expected = chi2_contingency(contigency) 
  cutoff = 0.05 / len(df.columns)
  if p <= cutoff:
    sigValues[column] = p #global variable declared in next block of code
  else:
    inSigValues[column] = p #global variable declared in next block of code
  allValues[column] = p #global variable declared in next block of code

In [None]:
sigValues = {} #All features and their associated p-values that have a p-value less than bonferroni correction value
inSigValues = {} #All features and their associated p-values that have a p-value greater than bonferroni correction value
allValues = {} #All features and their associated p-values, regardless of significance level

#This will loop through each feature in our dataframe, create a contingency table 
#and perform chi square test of independence to find the feature's association with CODISNDIS elgibility
#P-Values obtained will be used to sort the feature into sigValues or inSigValues
for column in categoricalFeatures:
  if column != predictedVariable:
    try:
      newContigency= pd.crosstab(df[column], df[predictedVariable])
      runChi(newContigency, column)
    except:
      print(column)
      # print(df[column])

##Run Anova on Numerical Features
We will run ANOVA tests on all numerical features verus the feature we are trying to predict (in this case, CODISNDISeligibleProfile). These tests will return a P-value that we can compare to our established Bonferroni correction cutoff. 

Add in [Anova Example](https://colab.research.google.com/drive/1zpCaYq4-R3pnVpa4kimvU1Dcn0An-DoP) notebook

In [None]:
import scipy.stats as stats
#create dict for storing p values
anovaValue = {} #All features with their corresponding p-value
sigAnova = {} #Features with significant p-values
inSigAnova = {} #Features with insignificant p-values

Bcutoff = 0.05 / len(df.columns)
# stats f_oneway functions takes the groups as input and returns ANOVA F and p value

#This will loop through each feature in our dataframe and perform ANOVA testing to find the feature's association with CODISNDIS elgibility
#P-Values obtained will be used to sort the feature into sigAnova or inSigAnova
for column in numericalFeatures:
  if column != predictedVariable:
    temp = pd.DataFrame()
    temp[column] = df[column]
    temp[predictedVariable] = df[predictedVariable]
    temp = temp[temp[column] != "No Response"] 
    eligible = temp[temp['CODISNDISeligibleProfile'] == "0"] 
    ineligible = temp[temp['CODISNDISeligibleProfile'] == "1"] 

    fvalue, pvalue = stats.f_oneway(eligible[column], ineligible[column]) #compares the means between eligible and noneligible within the given feature
    anovaValue[column] = pvalue

    if pvalue <= Bcutoff:
      sigAnova[column] = pvalue
    else:
      inSigAnova[column] = pvalue

# Part 3. Print Results 
Here we can see how our features stacked up. 


*   Results of Categorical Features
*   Results of Numerical Features




##Results of Categorical Features


1.   Table of ALL Features Sorted by Increasing P-Value
2.   Table of Nonsignificant Features
3.   Table of Significant Features



###1. Table of ALL Features Sorted by Increasing P-Value.

In [None]:

#@title All Categorical Features Sorted by Increasing P-Value
#Prints P-Values for all features (both significant and insignificant)
pd.set_option("display.max_rows", None, "display.max_columns", None)
table = pd.DataFrame(list(allValues.items())) 

table = table.rename(columns={0: "Feature", 1: "P-Value"})
table = table.sort_values(by=['P-Value'])

print(table)

                                             Feature        P-Value
228                              ProfileofSTRDNAloci   0.000000e+00
227                   ProbableSTRDNAprofileOFsuspect   0.000000e+00
236                        STRDNAProbableprofileTYPE   0.000000e+00
237                                  CODISprofileHit   0.000000e+00
240                       Swab1LocationSTRDNAprofile   0.000000e+00
244                                 CODISNDISreasons   0.000000e+00
245                                 CODISSDISreasons   0.000000e+00
238                                    STRDNAkitUsed  2.758086e-288
241                       Swab2LocationSTRDNAprofile  2.958663e-272
219                                       DNAKitUsed  1.152716e-258
223                                QuantMaleSwabLoc2  4.306772e-216
222                                QuantMaleSwabLoc1  4.844051e-185
247                         SwabToDNAanalysisVaginal  8.382494e-171
221                                QuantMaleDNAF

**The table above shows ALL features, regardless of whether they passed our significance threshold or not**

###2. Table of Nonsignificant Features

In [None]:
#@title Nonsignificant Features
#Display P-Values of all nonsignificant values
pd.set_option("display.max_rows", None, "display.max_columns", None)
inSigTable = pd.DataFrame(list(inSigValues.items())) 

inSigTable = inSigTable.rename(columns={0: "Feature", 1: "P-Value"})
inSigTable = inSigTable.sort_values(by=['P-Value'])

print(inSigTable)
print("Number of Nonsignificant Features: " + str(len(inSigTable)))
inSigTable.to_excel("insig.xlsx")


                                             Feature   P-Value
67                                    Suspectdruguse  0.000241
105                     LGIperiurethraltissueURETHRA  0.000329
146                         SwabToDNAanalysisBedding  0.000382
71                                   TonicImmobility  0.000415
104                                   LGIlabiaminora  0.000514
106                             LGIperihymenaltissue  0.000557
77                      SUSPECTmouthcontactOTHERsite  0.000761
63                                       Suspectrace  0.000817
145                   SwabToDNAanalysisOtherClothing  0.000886
110                                      LGIperineum  0.000949
78                           SuspectINJUREDbypatient  0.001152
137                             RandomSample20142015  0.001196
102                          LGIclitoralhoodclitoris  0.001518
15                            CurrentPhysicalmedprob  0.001723
28                               MedProbNeurological  0

**The table above shows only features that were not seen to be significantly related to our predicted feature**

###3. Table of Significant Features

In [None]:
#@title Significant Features
#Display P-Values of all significant values
pd.set_option("display.max_rows", None, "display.max_columns", None)
sigTable = pd.DataFrame(list(sigValues.items())) 

sigTable = sigTable.rename(columns={0: "Feature", 1: "P-Value"})
sigTable = sigTable.sort_values(by=['P-Value'])

print(sigTable)
print("Number of Significant Features: " + str(len(sigTable)))
sigTable.to_excel("sig.xlsx")

                                  Feature        P-Value
88                    ProfileofSTRDNAloci   0.000000e+00
97             Swab1LocationSTRDNAprofile   0.000000e+00
94                        CODISprofileHit   0.000000e+00
102                      CODISSDISreasons   0.000000e+00
101                      CODISNDISreasons   0.000000e+00
87         ProbableSTRDNAprofileOFsuspect   0.000000e+00
93              STRDNAProbableprofileTYPE   0.000000e+00
95                          STRDNAkitUsed  2.758086e-288
98             Swab2LocationSTRDNAprofile  2.958663e-272
79                             DNAKitUsed  1.152716e-258
83                      QuantMaleSwabLoc2  4.306772e-216
82                      QuantMaleSwabLoc1  4.844051e-185
104              SwabToDNAanalysisVaginal  8.382494e-171
81                      QuantMaleDNAFound  2.799993e-163
106             SwabToDNAanalysisPerianal  3.065993e-159
103  SwabToDNAanalysisNoquantmaleDNAfound  1.721807e-150
84                      QuantMa

**This table shows only significant features that we would likely use for further analysis of our dataset**

###We can dig deeper into the features we found significant by performing [Ad Hoc testing.](https://colab.research.google.com/drive/16IJCZW8NUS6oFwhgAjpKhdl1z_nYG2rL#scrollTo=BU6_9mNwc94H) This will help us to determine what categories in each significant feature lend to the feature's low p-value.

##Results of Numerical Features
As seen in the table below, all numerical features we looked at were significant in obtaining an eligible profile.

In [None]:
#@title All Numerical Features Sorted by Increasing P-Value
#Prints P-Values for all features (both significant and insignificant)
pd.set_option("display.max_rows", None, "display.max_columns", None)
anovaTable = pd.DataFrame(list(anovaValue.items())) 

anovaTable = anovaTable.rename(columns={0: "Feature", 1: "P-Value"})
anovaTable = anovaTable.sort_values(by=['P-Value'])

print(anovaTable)

                             Feature        P-Value
12          NumberOfswabsDNAanalysis   0.000000e+00
14        NumberOFswabsSTRDNAprofile   0.000000e+00
11         NumberOfswabsQuantMaleDNA  1.380887e-276
1   Timebetweenassaultandexaminhours   3.556378e-42
13                NumberofSTRDNAloci   1.223560e-38
4           NumberofUnknownresponses   5.072397e-21
8                NumberOFitemsTested   1.836985e-11
0                                Age   1.114992e-09
5               NumberAssaultiveActs   3.875028e-08
15               NumberOfYSTRDNAloci   4.021639e-07
7            Numberofgentialinjuries   8.459960e-04
9        TimeBetweenCollectAndDNAext   1.153214e-02
10   TimeBetweenSubmissionANDtesting   8.411413e-02
2                          PainLevel   1.079722e-01
3             MulitipleSuspectNumber   7.300534e-01
6           Numberofphysicalinjuries   9.954010e-01
