# Classification supervised learning 

**Goal:** Predict if a city would be considered as medical desert. 

For that we will do a multinomial classification with 0 = not medical desert, 1 = high risk of being medical desert and 2 = medical desert. To do so we'll use the threshold of DREES that is 2.5 for the APL (under 2.5, the city is considered as a medical desert) and put boundaries to avoid hard threshold for classification. 

**Plan:**
- Preprocessing data to make a proper classification
- Build a function to run all models and get the model with the best accuracy and precision score 
- Iteration on the best model to improve the performance

In [75]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

%matplotlib inline
pd.set_option('max_columns',30)

In [79]:
df = pd.read_csv('../data/medical_desert_clean.csv', dtype={'CODGEO':'str','Communes':'str'})
print('Shape:',df.shape)
df.head()

Shape: (34989, 26)


Unnamed: 0,CODGEO,Communes,APL,P16_POP,median_living_standard,healthcare_education_establishments,density_area,annual_pop_growth,unemployment_rate,secondary_residence_rate,vacant_residence_rate,active_local_business_rate,city_social_amenities_rate,0_14_pop_rate,15_59_pop_rate,60+_pop_rate,mobility_rate,average_birth_rate,CSP1_rate,CSP2_rate,CSP3_rate,CSP4_rate,CSP5_rate,CSP6_rate,CSP7_rate,CSP8_rate
0,1001,L'Abergement-Clémenciat,2.396,767,22679.0,0,48.087774,-0.335578,7.12743,4.597701,7.471264,48.0,14.0,20.990874,55.149935,23.859192,2.216428,1.060116,2.479339,3.305785,12.396694,15.702479,16.528926,20.661157,23.966942,4.958678
1,1002,L'Abergement-de-Varey,2.721,243,24382.083333,0,26.557377,0.757662,6.944444,30.769231,9.467456,57.894737,15.789474,22.633745,55.555556,21.8107,2.057613,1.761006,0.0,10.25641,7.692308,12.820513,20.512821,5.128205,33.333333,10.25641
2,1004,Ambérieu-en-Bugey,4.335,14081,19721.0,0,572.398374,0.347315,12.038385,1.684887,9.223702,67.838444,17.950636,19.82313,57.904337,22.272533,1.516341,1.595989,0.024879,2.662394,6.93941,17.209926,16.240671,15.94093,24.740051,16.241738
3,1005,Ambérieux-en-Dombes,4.279,1671,23378.0,0,104.962312,0.872154,6.34866,1.810755,4.979578,55.319149,10.638298,20.521782,58.339888,21.13833,0.985957,1.235096,0.378011,4.511481,7.896554,17.27101,18.019503,17.254154,23.429304,11.239984
4,1006,Ambléon,0.912,110,21660.0,0,18.707483,-0.359722,11.111111,16.216216,12.162162,71.428571,28.571429,10.909091,54.545455,34.545455,2.727273,1.621622,0.0,0.0,5.555556,27.777778,16.666667,16.666667,27.777778,5.555556


________________________
## Preprocessing 

Transform the dataframe to ensure a proper classification modelling. 

**Preprocessing tasks:**
- [x] Drop correlated and useless columns (Communes,P16_POP,60_pop_rate)
- [x] Create binary column to get the dependant variable 
- [x] Change scale of median_living_standard, density_area features
- [x] Check balance of dataset and correct it if needed

In [80]:
from sklearn.preprocessing import maxabs_scale

In [81]:
# Dropping correlated and useless columns

df.drop(columns=['Communes','P16_POP','60+_pop_rate'], inplace = True)
print("Shape after manipulation:", df.shape)

Shape after manipulation: (34989, 23)


In [82]:
# Creating binary column

df['medical_desert'] = df.APL.apply(lambda x: 0 if x>=3 else 1 if x>=2 else 2)
df.drop('APL',axis=1,inplace=True)
print("Shape after manipulation:",df.shape)

Shape after manipulation: (34989, 23)


In [88]:
# Performing scaling transformation depending in the type of data to put all the data between [0,1]
# healthcare_education_establishments is the only column that wasn't changed

df.median_living_standard = df.median_living_standard/100000 
df.density_area = maxabs_scale(df.density_area) # Using maxabs_scale because it does not break the sparsity
df.iloc[:,4:-1] = df.iloc[:,4:-1]/100

df.head()

Unnamed: 0,CODGEO,median_living_standard,healthcare_education_establishments,density_area,annual_pop_growth,unemployment_rate,secondary_residence_rate,vacant_residence_rate,active_local_business_rate,city_social_amenities_rate,0_14_pop_rate,15_59_pop_rate,mobility_rate,average_birth_rate,CSP1_rate,CSP2_rate,CSP3_rate,CSP4_rate,CSP5_rate,CSP6_rate,CSP7_rate,CSP8_rate,medical_desert
0,1001,0.22679,0,0.0012,-0.003356,0.071274,0.045977,0.074713,0.48,0.14,0.209909,0.551499,0.022164,0.010601,0.024793,0.033058,0.123967,0.157025,0.165289,0.206612,0.239669,0.049587,1
1,1002,0.243821,0,0.000663,0.007577,0.069444,0.307692,0.094675,0.578947,0.157895,0.226337,0.555556,0.020576,0.01761,0.0,0.102564,0.076923,0.128205,0.205128,0.051282,0.333333,0.102564,1
2,1004,0.19721,0,0.014289,0.003473,0.120384,0.016849,0.092237,0.678384,0.179506,0.198231,0.579043,0.015163,0.01596,0.000249,0.026624,0.069394,0.172099,0.162407,0.159409,0.247401,0.162417,0
3,1005,0.23378,0,0.00262,0.008722,0.063487,0.018108,0.049796,0.553191,0.106383,0.205218,0.583399,0.00986,0.012351,0.00378,0.045115,0.078966,0.17271,0.180195,0.172542,0.234293,0.1124,0
4,1006,0.2166,0,0.000467,-0.003597,0.111111,0.162162,0.121622,0.714286,0.285714,0.109091,0.545455,0.027273,0.016216,0.0,0.0,0.055556,0.277778,0.166667,0.166667,0.277778,0.055556,2


In [89]:
# Putting the CODGEO as index to avoid loosing the information and because we won't use it as feature

df = df.set_index('CODGEO')

In [90]:
# Checking the balance of the dataset

df.medical_desert.value_counts()

0    18751
1    10719
2     5519
Name: medical_desert, dtype: int64

We can see the dataset is imbalanced so we should make sure it is balanced to avoid biais in the model. 

As the minority is 5000+ observations we can only use under-sampling method to randomly pick the same number of observations in the other classes because it should be enough for modelling. 

In [91]:
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

In [92]:
X = df.drop(columns=['medical_desert'])
y = df.medical_desert

In [93]:
# Undersampling all but the minority 
print('Original dataset shape %s' % Counter(y))
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

Original dataset shape Counter({0: 18751, 1: 10719, 2: 5519})
Resampled dataset shape Counter({0: 5519, 1: 5519, 2: 5519})


### Conclusion on preprocessing 

- The data is all on the same scale (between [0,1]) except for the column healthcare_education_establishments
- All the classes have the same number of observations 

**Possible improvements:**
- Use feature engineering methods to reduce the number of columns