**Problem Statement:**

The pharmaceutical industry faces the challenge of managing vast amounts of data related to the composition, uses, and side effects of medicines. As the number of available medications continues to grow, healthcare providers need reliable insights to prescribe the most effective treatments while minimizing adverse effects. Additionally, understanding the distribution of medicine usage and patient satisfaction can help pharmaceutical companies improve their offerings.

the task is to analyze a dataset containing detailed information about over 11,000 medicines, including their salt compositions, uses, side effects, manufacturers, and user reviews. The goal is to uncover patterns and insights that can help improve decision-making in the healthcare industry and enhance patient outcomes.

**Dataset Access:**

• The dataset Medicine_Details.csv can be downloaded from Kaggle.


# Part 3: • Predict user satisfaction ratings with high accuracy using the given features.



In [1]:
import pandas as pd

# Load dataset
df = pd.read_csv('/content/Medicine_Details.csv')

# Display the first few rows
df.head()

# Check for missing values
print(df.isnull().sum())


Medicine Name         0
Composition           0
Uses                  0
Side_effects          0
Image URL             0
Manufacturer          0
Excellent Review %    0
Average Review %      0
Poor Review %         0
dtype: int64


In [3]:
# Drop rows with missing values in the target variable (Excellent Review %)
df = df.dropna(subset=['Excellent Review %'])

# Fill or drop missing values in other columns as necessary
df = df.fillna('')  # For simplicity, fill missing values with an empty string


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize TF-IDF vectorizers
tfidf_uses = TfidfVectorizer()
tfidf_side_effects = TfidfVectorizer()

# Vectorize text features
X_uses = tfidf_uses.fit_transform(df['Uses'])
X_side_effects = tfidf_side_effects.fit_transform(df['Side_effects'])

# Convert to DataFrame
X_uses_df = pd.DataFrame(X_uses.toarray(), columns=tfidf_uses.get_feature_names_out())
X_side_effects_df = pd.DataFrame(X_side_effects.toarray(), columns=tfidf_side_effects.get_feature_names_out())

# Combine TF-IDF features with the rest of the features
df_encoded = pd.concat([
    df.drop(columns=['Uses', 'Side_effects', 'Medicine Name', 'Image URL']),
    X_uses_df,
    X_side_effects_df
], axis=1)

# One-hot encode categorical columns
df_encoded = pd.get_dummies(df_encoded, columns=['Manufacturer', 'Composition'])

# Define features (X) and target (y)
X = df_encoded.drop(columns=['Excellent Review %'])
y = df_encoded['Excellent Review %']


In [5]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')


Mean Squared Error: 5.2870674126778304e-09
R-squared: 0.9999999999918778


In [6]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize and train the Gradient Boosting Regressor model
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train, y_train)

# Predict on the test set
y_pred_gb = gb_model.predict(X_test)

# Evaluate the model
mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

print(f'Gradient Boosting MSE: {mse_gb}')
print(f'Gradient Boosting R-squared: {r2_gb}')


Gradient Boosting MSE: 1.5234643891130433
Gradient Boosting R-squared: 0.9976596150004813
