<a href="https://colab.research.google.com/github/Madhurika1292/Medicines-and-Common-Treatment-Recommendation-System/blob/main/Druglib_com.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Druglib.com Data Exploration

Description -

Attribute Information:

drugName (categorical): name of drug
condition (categorical): name of condition
review (text): patient review
rating (numerical): 10 star patient rating
date (date): date of review entry
usefulCount (numerical): number of users who found review useful

In [1]:
#Loading necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns 
import matplotlib.pyplot as plt
from matplotlib import style; style.use('ggplot')
from wordcloud import WordCloud
%matplotlib inline


In [2]:
#Cloning Git repository
!git clone https://github.com/Madhurika1292/Medicines-and-Common-Treatment-Recommendation-System.git

Cloning into 'Medicines-and-Common-Treatment-Recommendation-System'...
remote: Enumerating objects: 91, done.[K
remote: Counting objects: 100% (91/91), done.[K
remote: Compressing objects: 100% (77/77), done.[K
remote: Total 91 (delta 45), reused 28 (delta 8), pack-reused 0[K
Unpacking objects: 100% (91/91), done.
Checking out files: 100% (13/13), done.


In [3]:
!ls Medicines-and-Common-Treatment-Recommendation-System/DrugsLib


drugLibTest_raw.tsv  drugLibTrain_raw.tsv


## Data Loading

In [4]:
#Loading Data
Drugslib_train = pd.read_csv('Medicines-and-Common-Treatment-Recommendation-System/DrugsLib/drugLibTrain_raw.tsv',sep='\t')
Drugslib_test = pd.read_csv('Medicines-and-Common-Treatment-Recommendation-System/DrugsLib/drugLibTest_raw.tsv',sep='\t')

In [5]:
#Data set shape
print("DrugsLib Train shape :" ,Drugslib_train.shape)
print("DrugsLib Test shape :", Drugslib_test.shape)

DrugsLib Train shape : (3107, 9)
DrugsLib Test shape : (1036, 9)


In [6]:
#Data information
print("Training Data information :")
Drugslib_train.info()

Training Data information :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3107 entries, 0 to 3106
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         3107 non-null   int64 
 1   urlDrugName        3107 non-null   object
 2   rating             3107 non-null   int64 
 3   effectiveness      3107 non-null   object
 4   sideEffects        3107 non-null   object
 5   condition          3106 non-null   object
 6   benefitsReview     3107 non-null   object
 7   sideEffectsReview  3105 non-null   object
 8   commentsReview     3099 non-null   object
dtypes: int64(2), object(7)
memory usage: 218.6+ KB


In [7]:
print("Test Data information :")
Drugslib_test.info()

Test Data information :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1036 entries, 0 to 1035
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         1036 non-null   int64 
 1   urlDrugName        1036 non-null   object
 2   rating             1036 non-null   int64 
 3   effectiveness      1036 non-null   object
 4   sideEffects        1036 non-null   object
 5   condition          1036 non-null   object
 6   benefitsReview     1036 non-null   object
 7   sideEffectsReview  1036 non-null   object
 8   commentsReview     1036 non-null   object
dtypes: int64(2), object(7)
memory usage: 73.0+ KB


In [8]:
#Changing the "Unnamed: 0" column to uniqueId as it represents the unique id of the drugs
Drugslib_train=Drugslib_train.rename(columns={'Unnamed: 0' : 'uniqueId'})
Drugslib_test=Drugslib_test.rename(columns={'Unnamed: 0' : 'uniqueId'})

In [9]:
print("DrugsLib Train shape :" ,Drugslib_train.columns)
print("DrugsLib Test shape :", Drugslib_test.columns)

DrugsLib Train shape : Index(['uniqueId', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects',
       'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],
      dtype='object')
DrugsLib Test shape : Index(['uniqueId', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects',
       'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],
      dtype='object')


## Data Understanding

### Checking the number of Drugs for each condition
For this analysis, I have combined the training set and testing set.

In [10]:
#Combining training and testing data
DrugsLib_combined=pd.concat([Drugslib_train,Drugslib_test])

In [11]:
DrugsLib_combined.head()


Unnamed: 0,uniqueId,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview
0,2202,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,3117,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,1146,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,3947,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,1951,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above


In [12]:
DrugsLib_combined.shape


(4143, 9)

### Number of drugs available for top conditions


In [24]:
analysis1 = DrugsLib_combined.groupby(['condition'])['urlDrugName'].nunique().sort_values(ascending = False).reset_index().head(30)

analysis1=analysis1.rename(columns={'urlDrugName':'Count of Drugs'})

fig = px.bar(analysis1, x='condition', y='Count of Drugs',
             hover_data=['condition', 'Count of Drugs'], color='Count of Drugs', title='Drugs Count for Top Conditions')
fig.show()

### Number of conditions present per drug

In [33]:
analysis2 = DrugsLib_combined.groupby(['urlDrugName'])['condition'].nunique().sort_values(ascending = False).reset_index().head(30)

analysis2=analysis2.rename(columns={'urlDrugName':'Drug Name','condition':'Count of conditions'})
fig = px.bar(analysis2, x='Drug Name', y='Count of conditions',
            hover_data=['Drug Name', 'Count of conditions'], color='Count of conditions', title='Conditions present per Drug',color_continuous_scale=px.colors.sequential.Viridis)
fig.show()

### Most Common Conditions based on Reviews

In [34]:
analysis3 = DrugsLib_combined['condition'].value_counts().head(30).reset_index()
analysis3.columns = ['condition','count']


fig = px.bar(analysis3, x='condition', y='count',
            hover_data=['condition', 'count'], color='count', title='Most Common Conditions based on Reviews')
fig.show()

### Ratings Distribution

In [40]:
fig = px.histogram(DrugsLib_combined, x="rating",color_discrete_sequence=['indianred'],title='Ratings distribution')
fig.show()


In [41]:
analysis4=(DrugsLib_combined['rating'].groupby(DrugsLib_combined['urlDrugName']).mean())
fig = px.histogram(analysis4, x="rating",color_discrete_sequence=['indianred'],title='Distribution of average drug ratings')
fig.show()


### Distribution of review ratings

In [60]:
# carryig out frequency of each rating
analysis5 = DrugsLib_combined['rating'].value_counts().reset_index()

# Converting float values to int
analysis5.columns = ['rating','count']
analysis5 = analysis5.astype({'rating':'int'})

# Plotting user rating distribution
size = analysis5['count']
colors = ['salmon','violet','lightgreen','pink','wheat','azure','sienna','orange','turquoise','olive']
labels = analysis5['rating']



import plotly.graph_objects as go
fig = go.Figure(data=[go.Pie(labels=labels, values=size, hole=.5)])
fig.update_layout(title_text="User ratings distribution",width=600,height=600)
fig.show()
#https://plotly.com/python/pie-charts/