<a href="https://colab.research.google.com/github/Madhurika1292/Medicines-and-Common-Treatment-Recommendation-System/blob/main/Drugscom_data_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Drugs.com Data Exploration

Description - 

Attribute Information:

1. drugName (categorical): name of drug
2. condition (categorical): name of condition
3. review (text): patient review
4. rating (numerical): 10 star patient rating
5. date (date): date of review entry
6. usefulCount (numerical): number of users who found review useful

In [46]:
#Loading necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px



In [None]:
#Cloning Git repository
!git clone https://github.com/Madhurika1292/Medicines-and-Common-Treatment-Recommendation-System.git

Cloning into 'Medicines-and-Common-Treatment-Recommendation-System'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 28 (delta 5), reused 16 (delta 2), pack-reused 0[K
Unpacking objects: 100% (28/28), done.


In [None]:
!ls Medicines-and-Common-Treatment-Recommendation-System/Drugscom

drugsComTest_raw.tsv  drugsComTrain_raw.tsv


## Data Loading

In [None]:
#Loading Data
Drugscom_train = pd.read_csv('Medicines-and-Common-Treatment-Recommendation-System/Drugscom/drugsComTrain_raw.tsv',sep='\t',parse_dates=['date'])
Drugscom_test = pd.read_csv('Medicines-and-Common-Treatment-Recommendation-System/Drugscom/drugsComTest_raw.tsv',sep='\t',parse_dates=['date'])


In [None]:
#Data set shape
print("Drugscom Train shape :" ,Drugscom_train.shape)
print("Drugscom Test shape :", Drugscom_test.shape)

Train shape : (161297, 7)
Test shape : (53766, 7)


In [None]:
#Data information
print("Training Data information :")
Drugscom_train.info()


Training Data information :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161297 entries, 0 to 161296
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Unnamed: 0   161297 non-null  int64         
 1   drugName     161297 non-null  object        
 2   condition    160398 non-null  object        
 3   review       161297 non-null  object        
 4   rating       161297 non-null  float64       
 5   date         161297 non-null  datetime64[ns]
 6   usefulCount  161297 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 8.6+ MB


In [None]:
print("Test Data information :")
Drugscom_test.info()

Test Data information :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53766 entries, 0 to 53765
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Unnamed: 0   53766 non-null  int64         
 1   drugName     53766 non-null  object        
 2   condition    53471 non-null  object        
 3   review       53766 non-null  object        
 4   rating       53766 non-null  float64       
 5   date         53766 non-null  datetime64[ns]
 6   usefulCount  53766 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 2.9+ MB


In [None]:
#Changing the "Unnamed: 0" column to uniqueId as it represents the unique id of the drugs
Drugscom_train=Drugscom_train.rename(columns={'Unnamed: 0' : 'uniqueId'})
Drugscom_test=Drugscom_test.rename(columns={'Unnamed: 0' : 'uniqueId'})

In [None]:
print("Drugscom Train shape :" ,Drugscom_train.columns)
print("Drugscom Test shape :", Drugscom_test.columns)

Drugscom Train shape : Index(['uniqueId', 'drugName', 'condition', 'review', 'rating', 'date',
       'usefulCount'],
      dtype='object')
Drugscom Test shape : Index(['uniqueId', 'drugName', 'condition', 'review', 'rating', 'date',
       'usefulCount'],
      dtype='object')


## Data understanding

### Checking if there are multiple reviews by the customer for particular medicine.

In [None]:
print("Unique IDs in Drugscom training set : " ,len(set(Drugscom_train['uniqueId'].values)))
print("Total length of Drugscom training set  : " ,Drugscom_train.shape[0])

Unique IDs in Drugscom training set :  161297
Total length of Drugscom training set  :  161297


It appears there is only one review by each customer.

## Checking the number of Drugs for each condition

For this analysis, I have combined the training set and testing set.

In [31]:
#Combining training and testing data
Drugscom_combined=pd.concat([Drugscom_train,Drugscom_test])

In [34]:
Drugscom_combined.head()

Unnamed: 0,uniqueId,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,2012-05-20,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,2010-04-27,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,2009-12-14,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,2015-11-03,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,2016-11-27,37


In [35]:
Drugscom_combined.shape

(215063, 7)

In [43]:
#To check the number of drugs per condition, I have grouped the data by "condition" column

Drugscom_conditon=pd.DataFrame(Drugscom_combined.groupby(['condition'])['drugName'].nunique().sort_values(ascending=False))

#resetting index
Drugscom_conditon.reset_index(inplace=True)

Drugscom_conditon.head(10)

Unnamed: 0,condition,drugName
0,Not Listed / Othe,253
1,Pain,219
2,Birth Control,181
3,High Blood Pressure,146
4,Acne,127
5,Depression,115
6,Rheumatoid Arthritis,107
7,"Diabetes, Type 2",97
8,Allergic Rhinitis,95
9,Insomnia,85


In [53]:
#visualizing the top 10 condition having maximum drugs available
fig = px.bar(Drugscom_conditon[:10], x='condition', y='drugName', color='drugName',labels={'drugName':'Count of Drugs'}, height=500, width=1500,title="Top 10 Conditions with highest number of Drugs availability")
fig.show()