# Classification supervised learning 

**Goal:** Predict if a city would be considered as medical desert. 

For that we will do a binary classification with 0 = not medical desert and 1 = medical desert. To do so we'll use the threshold of DREES that is 2.5 for the APL (under 2.5, the city is considered as a medical desert). 

**Plan:**
- Preprocessing data to make a proper classification
- Build a function to run all models and get the model with the best accuracy and precision score 
- Iteration on the best model to improve the performance

In [1]:
import pandas as pd
import numpy as np 

pd.set_option('max_columns',30)

In [2]:
df = pd.read_csv('../data/medical_desert_clean.csv', dtype={'CODGEO':'str','Communes':'str'})
print('Shape:',df.shape)
df.head()

Shape: (34989, 26)


Unnamed: 0,CODGEO,Communes,APL,P16_POP,median_living_standard,healthcare_education_establishments,density_area,annual_pop_growth,unemployment_rate,secondary_residence_rate,vacant_residence_rate,active_local_business_rate,city_social_amenities_rate,0_14_pop_rate,15_59_pop_rate,60+_pop_rate,mobility_rate,average_birth_rate,CSP1_rate,CSP2_rate,CSP3_rate,CSP4_rate,CSP5_rate,CSP6_rate,CSP7_rate,CSP8_rate
0,1001,L'Abergement-Clémenciat,2.396,767,22679.0,0,48.087774,-0.335578,7.12743,4.597701,7.471264,48.0,14.0,20.990874,55.149935,23.859192,2.216428,1.060116,2.479339,3.305785,12.396694,15.702479,16.528926,20.661157,23.966942,4.958678
1,1002,L'Abergement-de-Varey,2.721,243,24382.083333,0,26.557377,0.757662,6.944444,30.769231,9.467456,57.894737,15.789474,22.633745,55.555556,21.8107,2.057613,1.761006,0.0,10.25641,7.692308,12.820513,20.512821,5.128205,33.333333,10.25641
2,1004,Ambérieu-en-Bugey,4.335,14081,19721.0,0,572.398374,0.347315,12.038385,1.684887,9.223702,67.838444,17.950636,19.82313,57.904337,22.272533,1.516341,1.595989,0.024879,2.662394,6.93941,17.209926,16.240671,15.94093,24.740051,16.241738
3,1005,Ambérieux-en-Dombes,4.279,1671,23378.0,0,104.962312,0.872154,6.34866,1.810755,4.979578,55.319149,10.638298,20.521782,58.339888,21.13833,0.985957,1.235096,0.378011,4.511481,7.896554,17.27101,18.019503,17.254154,23.429304,11.239984
4,1006,Ambléon,0.912,110,21660.0,0,18.707483,-0.359722,11.111111,16.216216,12.162162,71.428571,28.571429,10.909091,54.545455,34.545455,2.727273,1.621622,0.0,0.0,5.555556,27.777778,16.666667,16.666667,27.777778,5.555556


________________________
## Preprocessing 

Transform the dataframe to ensure a proper classification modelling. 

**Preprocessing tasks:**
- [x] Drop correlated and useless columns (Communes,P16_POP,60_pop_rate)
- [x] Create binary column to get the dependant variable 
- [x] Change scale of median_living_standard, density_area features
- [ ] Check balance of dataset and correct it if needed
- [ ] Use feature engineering methods to reduce the number of columns

In [13]:
from sklearn.preprocessing import maxabs_scale

In [4]:
# Dropping correlated and useless columns

df.drop(columns=['Communes','P16_POP','60+_pop_rate'], inplace = True)
print(df.shape)

(34989, 23)


In [8]:
# Creating binary column

df['medical_desert'] = df.APL.apply(lambda x: 0 if x>2.5 else 1)
df.drop('APL',axis=1,inplace=True)
print(df.shape)

(34989, 23)


In [18]:
# Performing scaling transformation depending in the type of data

df.median_living_standard = df.median_living_standard/1000

df.density_area = maxabs_scale(df.density_area) # Using maxabs_scale because it does not break the sparsity

In [19]:
df.medical_desert.value_counts()

0    24870
1    10119
Name: medical_desert, dtype: int64