## Meta Data :
1. ID: Unique identifier for each entry. --> ID is Important 
2. TargetID: Identifier for the target.
3. DRUGID: Unique identifier for the drug.
4. DRUGTYPE: Type of drug (e.g., small molecular drug).
5. Drug_high_status: Approval status of the drug (e.g., Approved).
6. DRUGNAME: Name of the drug.
7. PUBCHCID: PubChem compound ID.
8. Disease_of_highest_status: Disease associated with the highest status.
9. Drug_Status: Status of the drug in the dataset.
10. UNIPROID: UniProt identifier for the target protein.
11. TARGNAME: Name of the target.
12. GENENAME: Gene name associated with the target.
13. SYNONYMS: Alternate names for the target.
14. FUNCTION: Biological function of the target.
15. BIOCLASS: Biological classification of the target.
16. SEQUENCE: Amino acid sequence of the target protein.
17. Disease: Disease associated with the drug.  --> Important
18. Accession Number: Accession number in relevant databases.
19. Target_Status: Current status of the target (e.g., Approved).

In [22]:
import pandas as pd 
import numpy as nd 
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.preprocessing import LabelEncoder, StandardScaler
%matplotlib inline


In [3]:
drug_df = pd.read_csv("train.csv")

In [4]:
drug_df.head(20)

Unnamed: 0,ID,TargetID,DRUGID,DRUGTYPE,Drug_high_status,DRUGNAME,PUBCHCID,Disease_of_highest_status,Drug_Status,UNIPROID,TARGNAME,GENENAME,SYNONYMS,FUNCTION,BIOCLASS,SEQUENCE,Disease,Accession Number,Target_Status
0,140736,T51115,D0L4YD,Small molecular drug,Approved,Solifenacin,154059,Overactive bladder,Approved,CAC1C_HUMAN,Voltage-gated calcium channel alpha Cav1.2 (CA...,CACNA1C,Voltage-gated calcium channel subunit alpha Ca...,Mediates influx of calcium ions into the cytop...,Voltage-gated ion channel,MVNENTRMYIPEENHQGSNYGSPRPAHANMNANAAAGLAPEHIPTP...,Genetic cardiac arrhythmia,Q13936,Terminated
1,133048,T60529,D03NMM,Small molecular drug,Investigative,AM-643,46843035,Dermatological disease,Investigative,PGH1_HUMAN,Prostaglandin G/H synthase 1 (COX-1),PTGS1,Prostaglandin-endoperoxide synthase 1; Prostag...,Converts arachidonate to prostaglandin H2 (PGH...,Paired donor oxygen oxidoreductase,MSRSLLLWFLLFLLLLPPLPVLLADPGAPTPVNPCCYYPCQHQGIC...,Rheumatoid arthritis,P23219,Approved
2,60493,T80975,D0T2ER,Small molecular drug,Phase 1,TAK-593,24767976,Solid tumour/cancer,Phase 1,VGFR2_HUMAN,Vascular endothelial growth factor receptor 2 ...,KDR,VEGFR2; VEGFR-2; VEGF-2 receptor; Protein-tyro...,Plays an essential role in the regulation of a...,Kinase,MQSKVLLAVALWLCVETRAASVGLPSVSLDLPRLSIQKDILTIKAN...,Renal cell carcinoma,P35968,Approved
3,169176,T92072,D07ESH,Small molecular drug,Discontinued in Phase 3,PF-1913539,176408,Alzheimer disease,Discontinued in Phase 3,AA1R_HUMAN,Adenosine A1 receptor (ADORA1),ADORA1,Adenosine receptor A1; A(1) adenosine receptor,The activity of this receptor is mediated by G...,GPCR rhodopsin,MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFC...,Hyper-lipoproteinaemia,P30542,Phase 2
4,120183,T30082,D0Q0RC,Small molecular drug,Approved,Ethopropazine,3290,Parkinson disease,Approved,ACES_HUMAN,Acetylcholinesterase (AChE),ACHE,YT; N-ACHE; ARACHE,Role in neuronal apoptosis. Terminates signal ...,Carboxylic ester hydrolase,MRPPQCLLHTPSLASPLLLLLLWLLGGGVGAEGREDAELLVTVRGG...,Oesophageal/gastroduodenal disorder,P22303,Approved
5,137655,T75243,D0W7HE,Small molecular drug,Approved,Alpelisib,56649450,Solid tumour/cancer,Phase 2,MTOR_HUMAN,Serine/threonine-protein kinase mTOR (mTOR),MTOR,Target of rapamycin; TOR kinase; Rapamycin tar...,MTOR directly or indirectly regulates the phos...,Kinase,MLGTGPAAATTAATTSSNVSVLQQFASGLKSRNEETRAKAAKELQH...,Bladder cancer,P42345,Phase 1/2
6,86935,T85435,D0O0LS,Small molecular drug,Approved,Entrectinib,25141092,Solid tumour/cancer,Phase 2,INSR_HUMAN,Insulin receptor (INSR),INSR,IR; CD220 antigen; CD220,Binding of insulin leads to phosphorylation of...,Kinase,MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTR...,Acute diabete complication,P06213,Approved
7,96089,T40954,D0N0ES,Small molecular drug,Approved,Vitamin B3,938,Lipid metabolism disorder,Approved,XDH_HUMAN,Xanthine dehydrogenase/oxidase (XDH),XDH,Xanthine oxidase; Xanthine dehydrogenase; XDHA,Catalyzes the oxidation of hypoxanthine to xan...,CH/CH(2) oxidoreductase,MTADKLVFFVNGRKVVEKNADPETTLLAYLRRKLGLSGTKLGCGEG...,Encephalopathy,P47989,Phase 2
8,142033,T21945,D04PXS,Small molecular drug,Phase 3,Amitifadine,11658655,Mood disorder,Phase 2,SC6A2_HUMAN,Norepinephrine transporter (NET),SLC6A2,Solute carrier family 6 member 2; Sodium-depen...,Amine transporter. Terminates the action of no...,Neurotransmitter:sodium symporter,MLLARMNPQVQPENNGADTGPEQPLRARKTAELLVVKERNGVQCLL...,Pain,P23975,Approved
9,25065,T78709,D0P2QO,Small molecular drug,Phase 2,S-15535,132787,Anxiety disorder,Phase 2,5HT1A_HUMAN,5-HT 1A receptor (HTR1A),HTR1A,Serotonin receptor 1A; G-21; ADRBRL1; ADRB2RL1...,Functions as a receptor for various drugs and ...,GPCR rhodopsin,MDVLSPGQGNNTTSPPAPFETGGNTTGISDVTVSYQVITSLLLGTL...,Attention deficit hyperactivity disorder,P08908,Phase 3


In [5]:
drug_df.shape

(134486, 19)

In [6]:
drug_df.describe()

Unnamed: 0,ID,PUBCHCID
count,134486.0,134486.0
mean,95958.384107,13973080.0
std,55524.64534,25618850.0
min,2.0,119.0
25%,47764.25,54841.0
50%,96015.5,5280980.0
75%,144131.75,11658860.0
max,192122.0,135743700.0


No Null Values are present 

In [7]:
drug_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134486 entries, 0 to 134485
Data columns (total 19 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   ID                         134486 non-null  int64 
 1   TargetID                   134486 non-null  object
 2   DRUGID                     134486 non-null  object
 3   DRUGTYPE                   134486 non-null  object
 4   Drug_high_status           134486 non-null  object
 5   DRUGNAME                   134486 non-null  object
 6   PUBCHCID                   134486 non-null  int64 
 7   Disease_of_highest_status  134486 non-null  object
 8   Drug_Status                134486 non-null  object
 9   UNIPROID                   134486 non-null  object
 10  TARGNAME                   134486 non-null  object
 11  GENENAME                   134486 non-null  object
 12  SYNONYMS                   134486 non-null  object
 13  FUNCTION                   134486 non-null  

In [8]:
drug_df.columns

Index(['ID', 'TargetID', 'DRUGID', 'DRUGTYPE', 'Drug_high_status', 'DRUGNAME',
       'PUBCHCID', 'Disease_of_highest_status', 'Drug_Status', 'UNIPROID',
       'TARGNAME', 'GENENAME', 'SYNONYMS', 'FUNCTION', 'BIOCLASS', 'SEQUENCE',
       'Disease', 'Accession Number', 'Target_Status'],
      dtype='object')

In [10]:
drug_df['DRUGTYPE'].value_counts()

DRUGTYPE
Small molecular drug                       134082
Small molecule immunotherapy                  192
Combination drug (small molecular drug)       174
Combination drug (Antibody)                    24
Protein/peptide drug                           12
Recombinant protein                             2
Name: count, dtype: int64

unique drug id's are 2443

In [30]:
drug_df['DRUGID'].value_counts()


DRUGID
D0O0LS    1997
D0W5HK    1601
D0H3HM    1439
D0AZ3C    1368
D0U3EP    1290
          ... 
D0Z2FD       1
D0B2OT       1
D09VBE       1
D03XYW       1
D01AQT       1
Name: count, Length: 2443, dtype: int64

In [33]:
drug_df['TargetID'].value_counts()

TargetID
T67162    6538
T59328    4037
T95913    4007
T78709    2853
T80975    2842
          ... 
T21678       1
T24634       1
T43206       1
T69506       1
T96685       1
Name: count, Length: 701, dtype: int64