## Data Mining and Preprocessing
#### The purpose of this notebook is to explore and preprocess the data which will in order to prepare it for the application's recommender engine. The data utilized here is a small sample of medical specialists in Yaounde, Cameroon. The recommender system will be built to match users-patients to medical specialists

In [1]:
#import statements for all relevant libraries
import pandas as pd
import numpy as np

### Data Import and Exploration

In [2]:
#import the specialist data as a dataframe
specialists = pd.read_csv("specialist.csv")

In [3]:
#Check the size of the data
specialists.shape

(191, 8)

In [4]:
#Get basic about the data constituent
specialists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191 entries, 0 to 190
Data columns (total 8 columns):
Id                       191 non-null int64
Specialist's name        190 non-null object
Specialist's category    191 non-null object
Institution              191 non-null object
Location                 191 non-null object
Timetable                191 non-null object
Language                 191 non-null object
Phone number             189 non-null float64
dtypes: float64(1), int64(1), object(6)
memory usage: 12.1+ KB


In [5]:
#Get a preview of the data structure
specialists.head()

Unnamed: 0,Id,Specialist's name,Specialist's category,Institution,Location,Timetable,Language,Phone number
0,1,DR YOMBA,CARDIOLOGIST,Le Jourdain,Nlongkak,Monday|Wednesday,French,237611100000.0
1,2,DR BOOMBHI,CARDIOLOGIST,Le Jourdain,Nlongkak,Tuesday|Wednesday|Thursday,French,237611100000.0
2,3,DR GRACE OKORO,DERMATOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Saturday,French,237611100000.0
3,4,DR TSONGUI,DERMATOLOGIST,Le Jourdain,Nlongkak,Thursday|Friday|Saturday,French,237611100000.0
4,5,DR ATEBA,ENDOCRINOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Thursday,French,237611100000.0


In [6]:
#Check for the existence of null values
specialists.isnull().values.any()

True

In [7]:
#Return the count of null values
specialists.isnull().sum()

Id                       0
Specialist's name        1
Specialist's category    0
Institution              0
Location                 0
Timetable                0
Language                 0
Phone number             2
dtype: int64

In [8]:
#Since we are missing a specialist's name which is the primary attribute of the data 
#and is very imperative for the purpose of the recommender system,
#we have to identify the row with the missing name and drop it
null_data = specialists[specialists.isnull().any(axis=1)]
null_data

Unnamed: 0,Id,Specialist's name,Specialist's category,Institution,Location,Timetable,Language,Phone number
52,53,,RADIOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Wednesday|Thursday|Friday|Saturday,French,237611100000.0
144,145,DR NOUBEG M. L.,RADIOLOGIST,Clinique Fouda,Quatier Fouda,On call,French,
152,153,DR NOA NOA TINA,OPHTHALMOLOGIST,Clinique Fouda,Quatier Fouda,Monday|Wednesday|Thursday,French,


In [11]:
#Now row 53 will be dropped since it lacks a name. The rows which lack phone numbers will be kept 
#because they have all other relevant information
specialists = specialists.drop(53)
specialists.shape

(190, 8)

In [15]:
#The columns will be renamed with shorter and easier names
specialists = specialists.rename(columns={"Specialist's name": "name", "Specialist's category": "specialty", "Timetable": "days"})
specialists.head()

Unnamed: 0,Id,name,specialty,Institution,Location,days,Language,Phone number
0,1,DR YOMBA,CARDIOLOGIST,Le Jourdain,Nlongkak,Monday|Wednesday,French,237611100000.0
1,2,DR BOOMBHI,CARDIOLOGIST,Le Jourdain,Nlongkak,Tuesday|Wednesday|Thursday,French,237611100000.0
2,3,DR GRACE OKORO,DERMATOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Saturday,French,237611100000.0
3,4,DR TSONGUI,DERMATOLOGIST,Le Jourdain,Nlongkak,Thursday|Friday|Saturday,French,237611100000.0
4,5,DR ATEBA,ENDOCRINOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Thursday,French,237611100000.0


## Data Preprocessing
#### In this step, the data will be restructure to make it easier to use when creating the recommender system. The end goal is to have a dataframe with 3 main parameters, the id, the title and the attributes which are the specialty, institution, location, days, language and phone number

In [None]:
#Now the last 6 columns are going to be merged into 1 column called attributes
