<a href="https://colab.research.google.com/github/Sagarjain93/Pharma_Medicines/blob/main/Pharma_medicines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project Tite: Exploratory Analysis of Indian Pharmaceuticals Dataset**

#**1. Introduction**

This EDA aims to uncover patterns, distributions, and insights from a dataset of 253,973 medicines available in India. The dataset includes key features like product ID, name, price, manufacturer, type, packaging, and composition details. Through data profiling, visualizations, and summary statistics, we explore product availability, pricing trends, formulation completeness, and market diversity in the Indian pharmaceutical space.

**Potential Hypotheses to be explored **

**1.Higher-priced medicines are more likely to be discontinued.**
(Test by comparing price distributions across the Is_discontinued flag.)

**2.Some manufacturers dominate the market with a higher number of products.**
(Test by analyzing the frequency distribution of manufacturer_name.)

**3.Certain medicine types (e.g., tablets) are more common than others.**
(Test by checking the distribution of the type column.)

**4.Combination drugs (with short_composition2) tend to be more expensive.**
(Compare average price between entries with and without short_composition2.)

**5.Pack size is correlated with medicine price.**
(Extract numeric info from pack_size_label and check correlation with price(₹).)

**6.Discontinued medicines are more likely to come from lesser-known manufacturers.**
(Compare discontinuation rates across manufacturers.)



# **2. Dataset Description:**

**id** Unique identifier for each medicine entry.

**name**: Name of the medicine or drug product.

**price(₹)**: Price of the medicine in Indian Rupees.

**Is_discontinued**: Boolean flag indicating whether the medicine is discontinued (True) or still available (False).

**manufacturer_name**: Name of the company or brand that manufactures the medicine.

**type**: Form/type of the medicine (e.g., tablet, syrup, capsule).

**pack_size_label**: Label describing the packaging size or quantity (e.g., "10 tablets", "100 ml").

**short_composition1**: Primary active ingredient(s) in the medicine.

**short_composition2**: Secondary or additional active ingredient(s); may contain missing values.

# **3. Import Required Libraries**

We import Python libraries necessary for data manipulation and visualization

In [1]:
# Data Manipulation Libraries
import numpy as np
import pandas as pd

# Data Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set a consistent theme for all plots
sns.set(style = 'whitegrid')

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **4. Load The Dataset**

The dataset is loaded using the pandas library. The dataset has been uploaded to a GitHub repository. This approach allows the CSV file to be accessed directly via its raw URL, making the code cleaner and removing the need for manual authorization or drive mounting each time the notebook is run.

In [5]:
df = pd.read_csv('/content/drive/MyDrive/colab/eda/4. Pharma/MEDICINE DATASET/A_Z_medicines_dataset_of_India.csv')

#**5. Inital Data Inspection**

To gain a foundational understanding of the dataset, we begin with an initial inspection that covers several essential aspects. This includes previewing the first few records to get a sense of the structure and values, examining the data types of each feature to ensure they align with expectations, and reviewing the overall shape and completeness of the dataset. We also generate statistical summaries for both numerical and categorical features to identify distributions, detect potential anomalies, and guide further steps in the analysis pipeline.

##**5.1 Preview Few Records**

In [6]:
df.head()

Unnamed: 0,id,name,price(₹),Is_discontinued,manufacturer_name,type,pack_size_label,short_composition1,short_composition2
0,1,Augmentin 625 Duo Tablet,223.42,False,Glaxo SmithKline Pharmaceuticals Ltd,allopathy,strip of 10 tablets,Amoxycillin (500mg),Clavulanic Acid (125mg)
1,2,Azithral 500 Tablet,132.36,False,Alembic Pharmaceuticals Ltd,allopathy,strip of 5 tablets,Azithromycin (500mg),
2,3,Ascoril LS Syrup,118.0,False,Glenmark Pharmaceuticals Ltd,allopathy,bottle of 100 ml Syrup,Ambroxol (30mg/5ml),Levosalbutamol (1mg/5ml)
3,4,Allegra 120mg Tablet,218.81,False,Sanofi India Ltd,allopathy,strip of 10 tablets,Fexofenadine (120mg),
4,5,Avil 25 Tablet,10.96,False,Sanofi India Ltd,allopathy,strip of 15 tablets,Pheniramine (25mg),


In [8]:
df.tail()

Unnamed: 0,id,name,price(₹),Is_discontinued,manufacturer_name,type,pack_size_label,short_composition1,short_composition2
253968,253969,Ziyapod 100mg Oral Suspension,62.3,False,Ziyana Lifesciences Pvt Ltd,allopathy,bottle of 30 ml Oral Suspension,Cefpodoxime Proxetil (100mg),
253969,253970,Zemhart 30mg Tablet,54.0,False,Leeford Healthcare Ltd,allopathy,strip of 10 tablets,Diltiazem (30mg),
253970,253971,Zivex 25mg Tablet,57.0,False,Euro Organics,allopathy,strip of 10 tablets,Hydroxyzine (25mg),
253971,253972,ZI Fast 500mg Injection,152.0,False,Burgeon Health Series Private Limited,allopathy,vial of 1 Injection,Azithromycin (500mg),
253972,253973,Zyvocol 1% Dusting Powder,110.0,False,GBK Healthcare,allopathy,bottle of 75 gm Dusting Powder,Clotrimazole (1% w/w),


In [9]:
df.columns

Index(['id', 'name', 'price(₹)', 'Is_discontinued', 'manufacturer_name',
       'type', 'pack_size_label', 'short_composition1', 'short_composition2'],
      dtype='object')

##**5.2 Check the Dataset Shape**

In [10]:
df.shape

(253973, 9)

**Interpretation**- *The data has 253973 rows and 9 columns*

##**5.3 Dataset Summary Overview**

Check for missing values and data types of each column.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253973 entries, 0 to 253972
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  253973 non-null  int64  
 1   name                253973 non-null  object 
 2   price(₹)            253973 non-null  float64
 3   Is_discontinued     253973 non-null  bool   
 4   manufacturer_name   253973 non-null  object 
 5   type                253973 non-null  object 
 6   pack_size_label     253973 non-null  object 
 7   short_composition1  253973 non-null  object 
 8   short_composition2  112171 non-null  object 
dtypes: bool(1), float64(1), int64(1), object(6)
memory usage: 15.7+ MB


**Interpretation**- *The dataset contains 253,973 records and 9 columns, covering medicine ID, name, price, manufacturer, type, pack size, and composition details. One column (short_composition2) has missing values. Data types include integers, floats, booleans, and object strings.*

##**5.4 Statistical Summary of Numerical Columnns**

Generating statistical summary of numerical columns to understand their distribution, central tendency, and spread across the dataset

In [13]:
df.select_dtypes(include='number').describe()

Unnamed: 0,id,price(₹)
count,253973.0,253973.0
mean,126987.0,270.530844
std,73315.834296,3029.584134
min,1.0,0.0
25%,63494.0,48.0
50%,126987.0,79.0
75%,190480.0,140.0
max,253973.0,436000.0


##**5.5. Statistical Summary of Category Columns**

In [14]:
df.select_dtypes(include='object').describe()

Unnamed: 0,name,manufacturer_name,type,pack_size_label,short_composition1,short_composition2
count,253973,253973,253973,253973,253973,112171
unique,249398,7648,1,1929,8523,2980
top,NS 0.9% Infusion,Sun Pharmaceutical Industries Ltd,allopathy,strip of 10 tablets,Aceclofenac (100mg),Rabeprazole (20mg)
freq,12,2986,253973,116540,6930,4743
