## Data Exploration and Cleaning
#### The purpose of this notebook is to explore and clean the data which will in order to prepare it for the application's recommender engine. The data utilized here is a small sample of medical specialists in Yaounde, Cameroon. The recommender system will be built to match users-patients to medical specialists

In [1]:
#import statements for all relevant libraries
import pandas as pd
import numpy as np

### Data Import and Exploration

In [2]:
#import the specialist and rating data as dataframes
specialists = pd.read_csv("specialist.csv")

In [3]:
#Check the size of each dataframe
specialists.shape

(402, 8)

In [4]:
#Get basic details about the dataframe constituents
specialists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402 entries, 0 to 401
Data columns (total 8 columns):
Id             401 non-null float64
name           401 non-null object
specialty      401 non-null object
institution    401 non-null object
location       401 non-null object
days           401 non-null object
language       401 non-null object
telephone      401 non-null float64
dtypes: float64(2), object(6)
memory usage: 25.2+ KB


As shown above, the specialists dataset contains data specific to each specialist in the database with regards to their name, specialty, institution, location, days of work, their language and their telephone number. This data could be considered as their demographic data. This demographic info will be used to match them to create the right specialist profile for the user's needs

In [5]:
#Get a preview of the dataframes
specialists.head()

Unnamed: 0,Id,name,specialty,institution,location,days,language,telephone
0,1.0,DR YOMBA,CARDIOLOGIST,Le Jourdain,Nlongkak,Monday|Wednesday,French,237611000000.0
1,2.0,DR BOOMBHI,CARDIOLOGIST,Le Jourdain,Nlongkak,Tuesday|Wednesday|Thursday,French,237611000000.0
2,3.0,DR GRACE OKORO,DERMATOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Saturday+F4:F118,French,237611000000.0
3,4.0,DR TSONGUI,DERMATOLOGIST,Le Jourdain,Nlongkak,Thursday|Friday|Saturday,French,237611000000.0
4,5.0,DR ATEBA,ENDOCRINOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Thursday,French,237611000000.0


The table above displays a preview of the ratings of the specialists in the datasets. The ratings are between 1 and 5 with 1 implying that the user was unsatisfied by the specialist's services and 5 implying that the specialist provides stellar services.

In [6]:
#Check for the existence of null values
specialists.isnull().values.any()

True

In [7]:
#Return the count of null values
specialists.isnull().sum()

Id             1
name           1
specialty      1
institution    1
location       1
days           1
language       1
telephone      1
dtype: int64

In [8]:
#Return the row which has the missing data
null_data = specialists[specialists.isnull().any(axis=1)]
null_data

Unnamed: 0,Id,name,specialty,institution,location,days,language,telephone
401,,,,,,,,


In [9]:
#Dropping the empty row
specialists = specialists.drop(401)
specialists.shape

(401, 8)

In [10]:
#Return the different specialties in the dataset
specialists['specialty'].unique()

array(['CARDIOLOGIST', 'DERMATOLOGIST', 'ENDOCRINOLOGIST',
       'GASTROENTOROLOGIST', 'RESIDENT', 'INFECTOLOGIST', 'OB/GYN',
       'NEUROLOGIST', 'NEPHROLOGIST', 'NUTRITIONIST', 'ONDOSTOMATOLOGIA',
       'OPHTHALMOLOGIST', 'EAR, NOSE AND THROAT', 'ORTHOPEDIC SURGEON',
       'PEDIATRICIAN', 'PNEUMOLOGIST', 'PSYCHIATRIST', 'RHUMATOLOGIST',
       'UROLOGIST', 'KINESITHERAPIST', 'ETIOPATIEST', 'BIOLOGIST',
       'PSYCHOLOGIST', 'VISCERAL AND DIGESTIVE SURGERY', 'ORTHOPHONIST',
       'RADIOLOGIST', 'DENTIST', 'GENERAL PRACTITIONER',
       'GENERAL PPRACTIIONER', 'NURSE', 'NEUROSURGEON',
       'FAMILY PRACTITIONER', 'GASTRO ENTEROLOGUE', 'GENERAL SURGEON',
       'NEPHOLOGIST', 'ORTHOPEADIST', 'PHARMACIST', 'PULMONOLOGIST',
       'GENR', 'ORTHODENTIST'], dtype=object)

In [11]:
#Update speciality column by making sure spellings are in sync
specialists.replace({'specialty': 'GENERAL PPRACTIIONER'}, {'specialty': 'GENERAL PRACTITIONER'})

Unnamed: 0,Id,name,specialty,institution,location,days,language,telephone
0,1.0,DR YOMBA,CARDIOLOGIST,Le Jourdain,Nlongkak,Monday|Wednesday,French,2.376110e+11
1,2.0,DR BOOMBHI,CARDIOLOGIST,Le Jourdain,Nlongkak,Tuesday|Wednesday|Thursday,French,2.376110e+11
2,3.0,DR GRACE OKORO,DERMATOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Saturday+F4:F118,French,2.376110e+11
3,4.0,DR TSONGUI,DERMATOLOGIST,Le Jourdain,Nlongkak,Thursday|Friday|Saturday,French,2.376110e+11
4,5.0,DR ATEBA,ENDOCRINOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Thursday,French,2.376110e+11
...,...,...,...,...,...,...,...,...
396,397.0,DR COLETTE,PEDIATRICIAN,Independent,Yaounde,On call,French,2.379556e+10
397,398.0,DR RUDOLPHE,OB/GYN,Independent,Yaounde,On call,French,2.377560e+12
398,399.0,DR FLORENTIA,DENTIST,Independent,Yaounde,On call,French,2.377560e+12
399,400.0,DR LAURA KAMSONG,OB/GYN,Independent,Yaounde,On call,French,2.377556e+10


In [12]:
specialists.replace({'specialty': 'GASTRO ENTEROLOGUE'}, {'specialty': 'GASTROENTOROLOGIST'})

Unnamed: 0,Id,name,specialty,institution,location,days,language,telephone
0,1.0,DR YOMBA,CARDIOLOGIST,Le Jourdain,Nlongkak,Monday|Wednesday,French,2.376110e+11
1,2.0,DR BOOMBHI,CARDIOLOGIST,Le Jourdain,Nlongkak,Tuesday|Wednesday|Thursday,French,2.376110e+11
2,3.0,DR GRACE OKORO,DERMATOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Saturday+F4:F118,French,2.376110e+11
3,4.0,DR TSONGUI,DERMATOLOGIST,Le Jourdain,Nlongkak,Thursday|Friday|Saturday,French,2.376110e+11
4,5.0,DR ATEBA,ENDOCRINOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Thursday,French,2.376110e+11
...,...,...,...,...,...,...,...,...
396,397.0,DR COLETTE,PEDIATRICIAN,Independent,Yaounde,On call,French,2.379556e+10
397,398.0,DR RUDOLPHE,OB/GYN,Independent,Yaounde,On call,French,2.377560e+12
398,399.0,DR FLORENTIA,DENTIST,Independent,Yaounde,On call,French,2.377560e+12
399,400.0,DR LAURA KAMSONG,OB/GYN,Independent,Yaounde,On call,French,2.377556e+10


In [13]:
specialists.replace({'specialty': 'GENR'}, {'specialty': 'GENERAL PRACTITIONER'})

Unnamed: 0,Id,name,specialty,institution,location,days,language,telephone
0,1.0,DR YOMBA,CARDIOLOGIST,Le Jourdain,Nlongkak,Monday|Wednesday,French,2.376110e+11
1,2.0,DR BOOMBHI,CARDIOLOGIST,Le Jourdain,Nlongkak,Tuesday|Wednesday|Thursday,French,2.376110e+11
2,3.0,DR GRACE OKORO,DERMATOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Saturday+F4:F118,French,2.376110e+11
3,4.0,DR TSONGUI,DERMATOLOGIST,Le Jourdain,Nlongkak,Thursday|Friday|Saturday,French,2.376110e+11
4,5.0,DR ATEBA,ENDOCRINOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Thursday,French,2.376110e+11
...,...,...,...,...,...,...,...,...
396,397.0,DR COLETTE,PEDIATRICIAN,Independent,Yaounde,On call,French,2.379556e+10
397,398.0,DR RUDOLPHE,OB/GYN,Independent,Yaounde,On call,French,2.377560e+12
398,399.0,DR FLORENTIA,DENTIST,Independent,Yaounde,On call,French,2.377560e+12
399,400.0,DR LAURA KAMSONG,OB/GYN,Independent,Yaounde,On call,French,2.377556e+10


In [14]:
#Check for any inconsistencies again
specialists['specialty'].unique()

array(['CARDIOLOGIST', 'DERMATOLOGIST', 'ENDOCRINOLOGIST',
       'GASTROENTOROLOGIST', 'RESIDENT', 'INFECTOLOGIST', 'OB/GYN',
       'NEUROLOGIST', 'NEPHROLOGIST', 'NUTRITIONIST', 'ONDOSTOMATOLOGIA',
       'OPHTHALMOLOGIST', 'EAR, NOSE AND THROAT', 'ORTHOPEDIC SURGEON',
       'PEDIATRICIAN', 'PNEUMOLOGIST', 'PSYCHIATRIST', 'RHUMATOLOGIST',
       'UROLOGIST', 'KINESITHERAPIST', 'ETIOPATIEST', 'BIOLOGIST',
       'PSYCHOLOGIST', 'VISCERAL AND DIGESTIVE SURGERY', 'ORTHOPHONIST',
       'RADIOLOGIST', 'DENTIST', 'GENERAL PRACTITIONER',
       'GENERAL PPRACTIIONER', 'NURSE', 'NEUROSURGEON',
       'FAMILY PRACTITIONER', 'GASTRO ENTEROLOGUE', 'GENERAL SURGEON',
       'NEPHOLOGIST', 'ORTHOPEADIST', 'PHARMACIST', 'PULMONOLOGIST',
       'GENR', 'ORTHODENTIST'], dtype=object)

In [15]:
specialists.shape

(401, 8)

In [16]:
#Return the specialist count
len(specialists['specialty'].unique().tolist())

40

In [17]:
specialists.describe(include = ['object', 'float'])

Unnamed: 0,Id,name,specialty,institution,location,days,language,telephone
count,401.0,401,401,401,401,401,401,401.0
unique,,392,40,22,16,48,2,
top,,Consultation,OB/GYN,HGOPY,Ngousso,Monday| Tuesday| Wednesday| Thursday|Friday,French,
freq,,3,49,68,68,53,367,
mean,201.0,,,,,,,540125700000.0
std,115.902977,,,,,,,852316600000.0
min,1.0,,,,,,,69942020.0
25%,101.0,,,,,,,23795550000.0
50%,201.0,,,,,,,237611000000.0
75%,301.0,,,,,,,237956000000.0


In [18]:
specialists['specialty'].describe()

count        401
unique        40
top       OB/GYN
freq          49
Name: specialty, dtype: object

In [19]:
specialists['institution'].describe()

count       401
unique       22
top       HGOPY
freq         68
Name: institution, dtype: object

In [20]:
specialists['language'].describe()

count        401
unique         2
top       French
freq         367
Name: language, dtype: object

In [21]:
specialists['location'].describe()

count         401
unique         16
top       Ngousso
freq           68
Name: location, dtype: object

In [22]:
specialists['days'].describe()

count                                             401
unique                                             48
top       Monday| Tuesday| Wednesday| Thursday|Friday
freq                                               53
Name: days, dtype: object

In [23]:
#Export the clean data to a csv file for future use
specialists.to_csv('.\specialistLens.csv')

The ratings data is going to be explored to discover if there is any existent pattern or anything interesting to pick from the data.

#### In this step, the data will be restructure to make it easier to use when creating the recommender system. The end goal is to have a dataframe with 3 main parameters, the id, the title and the attributes which are the specialty, institution, location, days, language and phone number

In [24]:
#Now the last 6 columns are going to be merged into 1 column called attributes
specialists['attributes'] = specialists['specialty'] + '|' + specialists['institution'] + '|' + specialists['location'] + '|' + specialists['days'] + '|' + specialists['language']

The other columns are dropped so that the resulting dataframe has just 3 columns comprising the specialists_id, their name and the attributes which is considered to be their demographic data

In [25]:
specialist_clean = specialists.drop(['specialty', 'institution', 'location', 'days', 'language', 'telephone'], axis=1)
specialist_clean.head()

Unnamed: 0,Id,name,attributes
0,1.0,DR YOMBA,CARDIOLOGIST|Le Jourdain|Nlongkak|Monday|Wedne...
1,2.0,DR BOOMBHI,CARDIOLOGIST|Le Jourdain|Nlongkak|Tuesday|Wedn...
2,3.0,DR GRACE OKORO,DERMATOLOGIST|Le Jourdain|Nlongkak|Monday|Tues...
3,4.0,DR TSONGUI,DERMATOLOGIST|Le Jourdain|Nlongkak|Thursday|Fr...
4,5.0,DR ATEBA,ENDOCRINOLOGIST|Le Jourdain|Nlongkak|Monday|Tu...


In [26]:
#Dataframe is exported to csv file
specialist_clean.to_csv('.\specialistClean.csv')