## Data Mining and Preprocessing
#### The purpose of this notebook is to explore and preprocess the data which will in order to prepare it for the application's recommender engine. The data utilized here is a small sample of medical specialists in Yaounde, Cameroon. The recommender system will be built to match users-patients to medical specialists

In [1]:
#import statements for all relevant libraries
import pandas as pd
import numpy as np
from surprise import Reader
from collections import defaultdict
from surprise import Dataset

### Data Import and Exploration

In [2]:
#import the specialist and rating data as dataframes
specialists = pd.read_csv("specialist.csv")
ratings = pd.read_csv("specialists_ratings.csv")

In [3]:
#Check the size of each dataframe
specialists.shape

(402, 8)

In [4]:
ratings.shape

(220, 3)

In [5]:
#Get basic details about the dataframe constituents
specialists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402 entries, 0 to 401
Data columns (total 8 columns):
Id             401 non-null float64
name           401 non-null object
specialty      401 non-null object
institution    401 non-null object
location       401 non-null object
days           401 non-null object
language       401 non-null object
telephone      401 non-null float64
dtypes: float64(2), object(6)
memory usage: 25.2+ KB


As shown above, the specialists dataset contains data specific to each specialist in the database with regards to their name, specialty, institution, location, days of work, their language and their telephone number. This data could be considered as their demographic data. This demographic info will be used to match them to create the right specialist profile for the user's needs

In [6]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 3 columns):
user_id          220 non-null int64
specialist_id    220 non-null int64
ratings          220 non-null int64
dtypes: int64(3)
memory usage: 5.3 KB


In [7]:
#Get a preview of the dataframes
specialists.head()

Unnamed: 0,Id,name,specialty,institution,location,days,language,telephone
0,1.0,DR YOMBA,CARDIOLOGIST,Le Jourdain,Nlongkak,Monday|Wednesday,French,237611000000.0
1,2.0,DR BOOMBHI,CARDIOLOGIST,Le Jourdain,Nlongkak,Tuesday|Wednesday|Thursday,French,237611000000.0
2,3.0,DR GRACE OKORO,DERMATOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Saturday+F4:F118,French,237611000000.0
3,4.0,DR TSONGUI,DERMATOLOGIST,Le Jourdain,Nlongkak,Thursday|Friday|Saturday,French,237611000000.0
4,5.0,DR ATEBA,ENDOCRINOLOGIST,Le Jourdain,Nlongkak,Monday|Tuesday|Thursday,French,237611000000.0


In [8]:
ratings.head()

Unnamed: 0,user_id,specialist_id,ratings
0,1,177,5
1,1,28,1
2,1,391,4
3,1,365,5
4,1,356,3


The table above displays a preview of the ratings of the specialists in the datasets. The ratings are between 1 and 5 with 1 implying that the user was unsatisfied by the specialist's services and 5 implying that the specialist provides stellar services.

In [9]:
#Check for the existence of null values
specialists.isnull().values.any()

True

In [10]:
ratings.isnull().values.any()

False

In [11]:
#Return the count of null values
specialists.isnull().sum()

Id             1
name           1
specialty      1
institution    1
location       1
days           1
language       1
telephone      1
dtype: int64

In [12]:
#Return the row which has the missing data
null_data = specialists[specialists.isnull().any(axis=1)]
null_data

Unnamed: 0,Id,name,specialty,institution,location,days,language,telephone
401,,,,,,,,


In [13]:
#Dropping the empty row
specialists = specialists.drop(401)
specialists.shape

(401, 8)

In [14]:
#Export the clean data to a csv file for future use
specialists.to_csv('.\specialistLens.csv')

The ratings data is going to be explored to discover if there is any existent pattern or anything interesting to pick from the data.

In [15]:
ratings.describe()

Unnamed: 0,user_id,specialist_id,ratings
count,220.0,220.0,220.0
mean,12.277273,191.795455,3.059091
std,6.55712,116.307296,1.398355
min,1.0,2.0,1.0
25%,8.0,83.0,2.0
50%,12.0,179.0,3.0
75%,16.0,293.5,4.0
max,25.0,396.0,5.0


## Data Preprocessing
#### In this step, the data will be restructure to make it easier to use when creating the recommender system. The end goal is to have a dataframe with 3 main parameters, the id, the title and the attributes which are the specialty, institution, location, days, language and phone number

In [16]:
#Now the last 6 columns are going to be merged into 1 column called attributes
specialists['attributes'] = specialists['specialty'] + '|' + specialists['institution'] + '|' + specialists['location'] + '|' + specialists['days'] + '|' + specialists['language']

The other columns are dropped so that the resulting dataframe has just 3 columns comprising the specialists_id, their name and the attributes which is considered to be their demographic data

In [17]:
specialist_clean = specialists.drop(['specialty', 'institution', 'location', 'days', 'language', 'telephone'], axis=1)
specialist_clean.head()

Unnamed: 0,Id,name,attributes
0,1.0,DR YOMBA,CARDIOLOGIST|Le Jourdain|Nlongkak|Monday|Wedne...
1,2.0,DR BOOMBHI,CARDIOLOGIST|Le Jourdain|Nlongkak|Tuesday|Wedn...
2,3.0,DR GRACE OKORO,DERMATOLOGIST|Le Jourdain|Nlongkak|Monday|Tues...
3,4.0,DR TSONGUI,DERMATOLOGIST|Le Jourdain|Nlongkak|Thursday|Fr...
4,5.0,DR ATEBA,ENDOCRINOLOGIST|Le Jourdain|Nlongkak|Monday|Tu...


In [18]:
#Dataframe is exported to csv file
specialist_clean.to_csv('.\specialistClean.csv')

In [19]:
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))

In [20]:
# The columns must correspond to user id, specialist id and ratings (in that order).
rating_data = Dataset.load_from_df(ratings[['user_id', 'specialist_id', 'ratings']], reader)