RNN ,regression trees, k-nearest neighbors, support vector machines

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [3]:
data = pd.read_excel("C:/Users/Sadyo/Desktop/Doctor_fee_consultation/Final Participant Data Folder/Final_Train.xlsx")

In [4]:
data.head()

Unnamed: 0,Qualification,Experience,Rating,Place,Profile,Miscellaneous_Info,Fees
0,"BHMS, MD - Homeopathy",24 years experience,100%,"Kakkanad, Ernakulam",Homeopath,"100% 16 Feedback Kakkanad, Ernakulam",100
1,"BAMS, MD - Ayurveda Medicine",12 years experience,98%,"Whitefield, Bangalore",Ayurveda,"98% 76 Feedback Whitefield, Bangalore",350
2,"MBBS, MS - Otorhinolaryngology",9 years experience,,"Mathikere - BEL, Bangalore",ENT Specialist,,300
3,"BSc - Zoology, BAMS",12 years experience,,"Bannerghatta Road, Bangalore",Ayurveda,"Bannerghatta Road, Bangalore ₹250 Available on...",250
4,BAMS,20 years experience,100%,"Keelkattalai, Chennai",Ayurveda,"100% 4 Feedback Keelkattalai, Chennai",250


It can be observed that to do mathematical work on this data it needs lot of Preprocessing.First I would try to extract meaningful information from data.

In [5]:
data.isnull().sum()

Qualification            0
Experience               0
Rating                3302
Place                   25
Profile                  0
Miscellaneous_Info    2620
Fees                     0
dtype: int64

In [6]:
data.shape

(5961, 7)

Lot of work has to be done on Rating, Place, Miscellaneous_Info columns. Try to not to drop any column as we already have less number of predictors.

First, converting Rating into float or integer type :

In [7]:
# following code would give the ratings in float data type:
## making calculations on that would be easy and effective as well:

data['Rating'] = data['Rating'].str.rstrip('%').astype('float') / 100.0

In [8]:
data['Rating']

0       1.00
1       0.98
2        NaN
3        NaN
4       1.00
        ... 
5956    0.98
5957     NaN
5958    0.97
5959    0.90
5960    1.00
Name: Rating, Length: 5961, dtype: float64

In [9]:
# converting the Experince column in integers:

data["Experience"] = data['Experience'].str.replace(r'\D', '') # it will remove the text written with number of years

In [10]:
data["Experience"]

0       24
1       12
2        9
3       12
4       20
        ..
5956    19
5957    33
5958    41
5959    15
5960    17
Name: Experience, Length: 5961, dtype: object

In [11]:
# checking the type of data of Experience:
data["Experience"].dtype

dtype('O')

In [12]:
# converting datatype of Experience into integer:

data["Experience"] = data["Experience"].astype(str).astype(float) # as Object cannot be converted to integer data type.

In [13]:
# as there is no Null values in Experience column, so it can be converted to integer wright now:

data["Experience"] = data["Experience"].astype(int)

In [14]:
# checking the type of data of Experience:

data["Experience"].dtype

dtype('int32')

It has been successfully changed.

Now , treating Place column, it is difficult to deal with the present format of Place column :
    Trying splitting it into two columns :

In [15]:
# Following code will split the values of Place column into two seperate columns:

data['Place'], data['City'] = data['Place'].str.rsplit(', ', 1).str

In [16]:
data.head()

Unnamed: 0,Qualification,Experience,Rating,Place,Profile,Miscellaneous_Info,Fees,City
0,"BHMS, MD - Homeopathy",24,1.0,Kakkanad,Homeopath,"100% 16 Feedback Kakkanad, Ernakulam",100,Ernakulam
1,"BAMS, MD - Ayurveda Medicine",12,0.98,Whitefield,Ayurveda,"98% 76 Feedback Whitefield, Bangalore",350,Bangalore
2,"MBBS, MS - Otorhinolaryngology",9,,Mathikere - BEL,ENT Specialist,,300,Bangalore
3,"BSc - Zoology, BAMS",12,,Bannerghatta Road,Ayurveda,"Bannerghatta Road, Bangalore ₹250 Available on...",250,Bangalore
4,BAMS,20,1.0,Keelkattalai,Ayurveda,"100% 4 Feedback Keelkattalai, Chennai",250,Chennai


Now, trying treating Miscellaneous_Info:
    I find only number of feedback in this column that could be little relevant for Model Building.

In [17]:
# trying to seperate number of feedbacks from Miscellaneous_Info:

data['Miscellaneous_Info'] = data.Miscellaneous_Info.str.extract('% (.+) Feedback')

In [18]:
data.head()

Unnamed: 0,Qualification,Experience,Rating,Place,Profile,Miscellaneous_Info,Fees,City
0,"BHMS, MD - Homeopathy",24,1.0,Kakkanad,Homeopath,16.0,100,Ernakulam
1,"BAMS, MD - Ayurveda Medicine",12,0.98,Whitefield,Ayurveda,76.0,350,Bangalore
2,"MBBS, MS - Otorhinolaryngology",9,,Mathikere - BEL,ENT Specialist,,300,Bangalore
3,"BSc - Zoology, BAMS",12,,Bannerghatta Road,Ayurveda,,250,Bangalore
4,BAMS,20,1.0,Keelkattalai,Ayurveda,4.0,250,Chennai


In [19]:
doctor_data = data.rename(columns = {'Experience': 'Experience(in years)', 'Miscellaneous_Info': 'No. of Feedback'}, inplace = False)

In [20]:
doctor_data.head()

Unnamed: 0,Qualification,Experience(in years),Rating,Place,Profile,No. of Feedback,Fees,City
0,"BHMS, MD - Homeopathy",24,1.0,Kakkanad,Homeopath,16.0,100,Ernakulam
1,"BAMS, MD - Ayurveda Medicine",12,0.98,Whitefield,Ayurveda,76.0,350,Bangalore
2,"MBBS, MS - Otorhinolaryngology",9,,Mathikere - BEL,ENT Specialist,,300,Bangalore
3,"BSc - Zoology, BAMS",12,,Bannerghatta Road,Ayurveda,,250,Bangalore
4,BAMS,20,1.0,Keelkattalai,Ayurveda,4.0,250,Chennai


Now our data is looking more meaningful and representable.

In [21]:
# Rearranging the columns :

doctor_data = doctor_data[['Profile', 'Qualification', 'Experience(in years)', 'Rating', 'Place','City','No. of Feedback','Fees']]

In [22]:
doctor_data.head()

Unnamed: 0,Profile,Qualification,Experience(in years),Rating,Place,City,No. of Feedback,Fees
0,Homeopath,"BHMS, MD - Homeopathy",24,1.0,Kakkanad,Ernakulam,16.0,100
1,Ayurveda,"BAMS, MD - Ayurveda Medicine",12,0.98,Whitefield,Bangalore,76.0,350
2,ENT Specialist,"MBBS, MS - Otorhinolaryngology",9,,Mathikere - BEL,Bangalore,,300
3,Ayurveda,"BSc - Zoology, BAMS",12,,Bannerghatta Road,Bangalore,,250
4,Ayurveda,BAMS,20,1.0,Keelkattalai,Chennai,4.0,250


In [23]:
doctor_data['Rating'] = doctor_data['Rating']*100

In [24]:
doctor_data = doctor_data.rename(columns = {'Rating': 'Rating(in percent)'}, inplace = False)

In [25]:
doctor_data.head()

Unnamed: 0,Profile,Qualification,Experience(in years),Rating(in percent),Place,City,No. of Feedback,Fees
0,Homeopath,"BHMS, MD - Homeopathy",24,100.0,Kakkanad,Ernakulam,16.0,100
1,Ayurveda,"BAMS, MD - Ayurveda Medicine",12,98.0,Whitefield,Bangalore,76.0,350
2,ENT Specialist,"MBBS, MS - Otorhinolaryngology",9,,Mathikere - BEL,Bangalore,,300
3,Ayurveda,"BSc - Zoology, BAMS",12,,Bannerghatta Road,Bangalore,,250
4,Ayurveda,BAMS,20,100.0,Keelkattalai,Chennai,4.0,250


In [26]:
# Grouping/ Sorting data by Profile:

doctor_data_sorted = doctor_data.sort_values(by="Profile",ascending=True)

In [27]:
doctor_data_sorted.head(15)

Unnamed: 0,Profile,Qualification,Experience(in years),Rating(in percent),Place,City,No. of Feedback,Fees
1757,Ayurveda,"MD - Ayurveda Medicine, BAMS",9,,Balapur,Hyderabad,,200
563,Ayurveda,"BAMS, MD - Ayurveda Medicine",43,,Andheri East,Mumbai,,100
4648,Ayurveda,BAMS,10,,Jubilee Hills,Hyderabad,,100
3688,Ayurveda,BAMS,10,90.0,Borivali,Mumbai,,100
1449,Ayurveda,BAMS,9,,Rajouri Garden,Delhi,,500
567,Ayurveda,BAMS,11,,Mayur Vihar Ph-I,Delhi,,500
4635,Ayurveda,BAMS,2,,T Nagar,Chennai,,300
4629,Ayurveda,BAMS,8,,Ghatkopar East,Mumbai,,200
2511,Ayurveda,"BAMS, Fellowship in Cardiac Rehabilitation",9,88.0,Mira Bhayandar,Mumbai,3.0,150
3012,Ayurveda,"BAMS, MS - Ayurveda",8,,Malleswaram,Bangalore,,250


In [28]:
doctor_data_sorted.isnull().sum()

Profile                    0
Qualification              0
Experience(in years)       0
Rating(in percent)      3302
Place                     25
City                      26
No. of Feedback         4461
Fees                       0
dtype: int64

In [29]:
doctor_data_sorted.shape

(5961, 8)

Now, if we see that earlier Miscellaneous_Info had 2620 NaN values , which now has been increased to 4461 almost 75% of the size of the data.
So, it would not provide us any significant information. Hence better to leave it.

In [30]:
# Dropping No. of Feedback from the dataset:

doctor_data_sorted.drop("No. of Feedback", axis = 1, inplace = True)

Now, we have left with three columns Rating(in percent),Place and City having NULL values.

TREATING QUALIFICATION COLUMN:
    

In [31]:
doctor_data_sorted['Qualification'].dtype

dtype('O')

In [32]:
doctor_data_sorted.shape

(5961, 7)

In [33]:
doctor_data_sorted["Qualification"].unique().tolist()

['MD - Ayurveda Medicine, BAMS',
 'BAMS, MD - Ayurveda Medicine',
 'BAMS',
 'BAMS, Fellowship in Cardiac Rehabilitation',
 'BAMS, MS - Ayurveda',
 'BAMS, MD - Ayurveda Medicine, Fellowship in Medical Cosmetology (FMC)',
 'BAMS, M. D. IN KAYACHIKISTA',
 'MD - Ayurveda Medicine, BAMS, PG Diploma in Sexual Medicine',
 'BAMS, Certificate in Child Health (CCH), CGO',
 'BAMS, D.Y.A, Post Graduate Diploma in Emergency Services (PGDEMS), MD - Ayurveda Medicine',
 'BAMS, Post Graduate Diploma in Holistic Healthcare',
 'BAMS, MS - Psychology, PhD - Psychology, MSc - Yoga',
 'MD - Social & Preventive Medicine / Community Medicine, BAMS',
 'MD - General Medicine, BAMS, Diploma In Naturopathy (ND)',
 'BAMS, D.Ac, Post Graduate Diploma In Yoga',
 'MD - Ayurveda Medicine, MBA, Post Graduate Diploma in Clinical Research (PGDCR)',
 'BAMS, Diploma in Emergency Medicine, Diploma in Counselling Skills',
 'BAMS, DSM ( Siddha Medicine), Diploma in Emergency Medicine',
 'BAMS, Diploma in Health Administratio

In [34]:
doctor_data_sorted["Qualification"].nunique()

1420

In [35]:
doctor_data_sorted["Qualification"] = str(doctor_data_sorted["Qualification"])


In [32]:
# Extracting relevant qualification

doctor_data_sorted["Qualification"] = doctor_data_sorted["Qualification"].str.split("," ,expand = True,)

Qualification = {}
for x in doctor_data_sorted["Qualification"].values:
    for each in x:
        each = each.strip()
        if each in Qualification:
            
            Qualification[each]+1
        else:
                Qualification[each] = 1

Here, we can see that Qualification column has 1420 unique values that is not possible to code, so I take here most occuring 20 Qualifications like MBBS, MD,BHMS,DM,DNB etc.

In [33]:
most_occ = sorted(doctor_data_sorted['Qualification'].items(), key = lambda x:x[1],reverse=True)[:20]

final_qual = []

for tup in most_occ:
    final_qual.append(tup[0])
    
for x,y in zip(doctor_data_sorted["Qualification"].values, np.array([idx for idx in range(len(doctor_data_sorted))])):
    for q in x:
        q = q.strip()
        if q in final_qual:
            doctor_data_sorted[q][y] = 1
            
            


In [38]:
pd.set_option("display.max_rows", None)

In [39]:
doctor_data_sorted

Unnamed: 0,Profile,Qualification,Experience(in years),Rating(in percent),Place,City,Fees
1757,Ayurveda,MD - Ayurveda Medicine,9,,Balapur,Hyderabad,200
563,Ayurveda,BAMS,43,,Andheri East,Mumbai,100
4648,Ayurveda,BAMS,10,,Jubilee Hills,Hyderabad,100
3688,Ayurveda,BAMS,10,90.0,Borivali,Mumbai,100
1449,Ayurveda,BAMS,9,,Rajouri Garden,Delhi,500
567,Ayurveda,BAMS,11,,Mayur Vihar Ph-I,Delhi,500
4635,Ayurveda,BAMS,2,,T Nagar,Chennai,300
4629,Ayurveda,BAMS,8,,Ghatkopar East,Mumbai,200
2511,Ayurveda,BAMS,9,88.0,Mira Bhayandar,Mumbai,150
3012,Ayurveda,BAMS,8,,Malleswaram,Bangalore,250


In [None]:
doctor_data_sorted.drop('Qualification', axis = 1, inplace = True)    


In [46]:
pd.set_option("display.max_rows", None)

In [None]:
doctor_data_sorted

In [None]:
# removing missing values from place and city with their modes

In [40]:
doctor_data_sorted['Place'].value_counts()

HSR Layout                  75
Andheri West                70
Dwarka                      67
Banjara Hills               64
Mulund West                 54
Borivali West               52
Kandivali West              50
Indiranagar                 48
Malad West                  47
Malleswaram                 47
Whitefield                  47
Pitampura                   45
Vileparle West              43
Andheri East                43
Powai                       42
Jubilee Hills               40
Bannerghatta Road           39
Marathahalli                39
Bandra West                 38
Kondapur                    38
Ghatkopar East              38
Secunderabad                37
Rohini                      37
Janak Puri                  37
Adyar                       37
Shalimar Bagh               36
Paschim Vihar               36
KPHB                        35
Koramangala                 35
R.S. Puram                  34
Kukatpally                  34
Old Rajendra Nagar          34
Madhapur

In [None]:
in place column HSR L

In [41]:
doctor_data_sorted['City'].value_counts()

Bangalore             1258
Mumbai                1219
Delhi                 1185
Hyderabad              951
Chennai                855
Coimbatore             228
Ernakulam              153
Thiruvananthapuram      86
Name: City, dtype: int64

In [42]:
doctor_data_sorted['City'].isnull().sum()

26

In [44]:
#AS Banglore is the mode of this column so replace NULL values by Bangalore

doctor_data_sorted['City'] = doctor_data_sorted.City.fillna('Bangalore')

In [45]:
doctor_data_sorted['City'].isnull().sum()

0

In [46]:
doctor_data_sorted['City'].value_counts()

Bangalore             1284
Mumbai                1219
Delhi                 1185
Hyderabad              951
Chennai                855
Coimbatore             228
Ernakulam              153
Thiruvananthapuram      86
Name: City, dtype: int64