# Your First Algorithm of Machine Learning (ML) - Linear Regression Proyect
## Table of contents
### Explore Data Analysis (EDA)

* [0. Problem Statement](#co)
* [1. Importing libraries](#c1)
* [2. Data Collection](#c2)
* [3. Exploration and Data Cleaning](#c3)
  * [3.1 Drop Duplicate Information](#c3-1)
  * [3.2 Drop Irrelevant Information](#c3-2)
* [4. Univariate Analysis](#c4)
  * [4.1 Dividing our dataset into categorical and numerical](#c4-1)
  * [4.2 Categorical Variable Analysis](#c4-2)
  * [4.3 Numerical Variable Analysis](#c4-3)
* [5. Multivariate Analysis](#c5)
  * [5.1 Categorical-Categorical Analysis](#c5-1)
  * [5.2 Encoding Categorical Values and Saving JSON Files](#c5-2)
  * [5.3 Numerical-Categorical Analysis](#c5-3)
* [6. Feature Engineering](#c6)
  * [6.1 New Feature Inference](#c6-1)
  * [6.2 Outliers Analysis](#c6-2)
  * [6.3 Missing Values Analysis](#c6-3)
  * [6.4 Split train/test of both Data Frames](#c6-4)
* [7. Data Processing](#c7)
  * [7.1 Normalization](#c7-1)
  * [7.2 Minmax Scalation](#c7-2)

### Machine Learning
* [8. Machine Learning](#c8)
  * [8.1 Which is the best dataset?](#c8-1)
  * [8.2 Model Optimization](#c8-2)
  * [8.3 Evaluation of the Model](#c8-3)

## 0. Problem Statement <a id='c0'></a>
The objective of this exercise is to conduct an in-depth analysis of the bank marketing campaign dataset with an EDA. We'll then develop a predictive model to determine if there is any relationship between health resources and sociodemographic data at the US county level.

## 1. Importing libraries <a id='c1'></a>

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
import json
import warnings
import pickle
from pickle import dump
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def warn(*args, **kwargs):
    pass
warnings.warn = warn
warnings.filterwarnings("ignore", category=FutureWarning)
pd.set_option('display.max_columns', None)

## 2. Data Collection <a id='c2'></a>

In [6]:
URL = 'https://raw.githubusercontent.com/4GeeksAcademy/regularized-linear-regression-project-tutorial/main/demographic_health_data.csv'

def get_data(URL:str) -> pd.DataFrame:
    total_data = pd.read_csv(URL, sep=',')
    total_data.head()
    return total_data

get_data(URL)

Unnamed: 0,fips,TOT_POP,0-9,0-9 y/o % of total pop,19-Oct,10-19 y/o % of total pop,20-29,20-29 y/o % of total pop,30-39,30-39 y/o % of total pop,40-49,40-49 y/o % of total pop,50-59,50-59 y/o % of total pop,60-69,60-69 y/o % of total pop,70-79,70-79 y/o % of total pop,80+,80+ y/o % of total pop,White-alone pop,% White-alone,Black-alone pop,% Black-alone,Native American/American Indian-alone pop,% NA/AI-alone,Asian-alone pop,% Asian-alone,Hawaiian/Pacific Islander-alone pop,% Hawaiian/PI-alone,Two or more races pop,% Two or more races,POP_ESTIMATE_2018,N_POP_CHG_2018,GQ_ESTIMATES_2018,R_birth_2018,R_death_2018,R_NATURAL_INC_2018,R_INTERNATIONAL_MIG_2018,R_DOMESTIC_MIG_2018,R_NET_MIG_2018,Less than a high school diploma 2014-18,High school diploma only 2014-18,Some college or associate's degree 2014-18,Bachelor's degree or higher 2014-18,Percent of adults with less than a high school diploma 2014-18,Percent of adults with a high school diploma only 2014-18,Percent of adults completing some college or associate's degree 2014-18,Percent of adults with a bachelor's degree or higher 2014-18,POVALL_2018,PCTPOVALL_2018,PCTPOV017_2018,PCTPOV517_2018,MEDHHINC_2018,CI90LBINC_2018,CI90UBINC_2018,Civilian_labor_force_2018,Employed_2018,Unemployed_2018,Unemployment_rate_2018,Median_Household_Income_2018,Med_HH_Income_Percent_of_State_Total_2018,Active Physicians per 100000 Population 2018 (AAMC),Total Active Patient Care Physicians per 100000 Population 2018 (AAMC),Active Primary Care Physicians per 100000 Population 2018 (AAMC),Active Patient Care Primary Care Physicians per 100000 Population 2018 (AAMC),Active General Surgeons per 100000 Population 2018 (AAMC),Active Patient Care General Surgeons per 100000 Population 2018 (AAMC),Total nurse practitioners (2019),Total physician assistants (2019),Total Hospitals (2019),Internal Medicine Primary Care (2019),Family Medicine/General Practice Primary Care (2019),Total Specialist Physicians (2019),ICU Beds_x,Total Population,Population Aged 60+,Percent of Population Aged 60+,COUNTY_NAME,STATE_NAME,STATE_FIPS,CNTY_FIPS,county_pop2018_18 and older,anycondition_prevalence,anycondition_Lower 95% CI,anycondition_Upper 95% CI,anycondition_number,Obesity_prevalence,Obesity_Lower 95% CI,Obesity_Upper 95% CI,Obesity_number,Heart disease_prevalence,Heart disease_Lower 95% CI,Heart disease_Upper 95% CI,Heart disease_number,COPD_prevalence,COPD_Lower 95% CI,COPD_Upper 95% CI,COPD_number,diabetes_prevalence,diabetes_Lower 95% CI,diabetes_Upper 95% CI,diabetes_number,CKD_prevalence,CKD_Lower 95% CI,CKD_Upper 95% CI,CKD_number,Urban_rural_code
0,1001,55601,6787,12.206615,7637,13.735364,6878,12.370281,7089,12.749771,7582,13.636445,7738,13.917016,5826,10.478229,4050,7.284042,2014,3.622237,42660,76.725239,10915,19.630942,267,0.480207,681,1.224798,62,0.111509,1016,1.827305,55601,158,455,11.8,9.6,2.2,0.0,0.7,0.6,4204,12119,10552,10291,11.3,32.6,28.4,27.7,7587,13.8,19.3,19.5,59338,53628,65048,25957,25015,942,3.6,59338,119.0,217.1,196.7,77.2,71.2,7.6,6.9,28.859137,6.085786,1.148905,25.992561,21.249061,72.142154,6,55036,10523,19.1,Autauga,Alabama,1,1,42438,47.6,45.4,49.4,20181,35.8,34.2,37.3,15193,7.9,7.2,8.7,3345,8.6,7.3,9.9,3644,12.9,11.9,13.8,5462,3.1,2.9,3.3,1326,3
1,1003,218022,24757,11.355276,26913,12.344167,23579,10.814964,25213,11.564429,27338,12.539102,29986,13.753658,29932,13.728890,20936,9.602701,9368,4.296814,190301,87.285228,19492,8.940382,1684,0.772399,2508,1.150343,146,0.066966,3891,1.784682,218022,5403,2190,10.5,10.3,0.1,0.5,24.3,24.8,14310,40579,46025,46075,9.7,27.6,31.3,31.3,21069,9.8,13.9,13.1,57588,54437,60739,93849,90456,3393,3.6,57588,115.5,217.1,196.7,77.2,71.2,7.6,6.9,113.162114,23.863512,4.505074,101.921730,83.321572,282.882982,51,203360,53519,26.3,Baldwin,Alabama,1,3,170912,40.2,38.2,42.3,68790,29.7,28.4,31.0,50761,7.8,7.0,8.7,13414,8.6,7.2,10.1,14692,12.0,11.0,13.1,20520,3.2,3.0,3.5,5479,4
2,1005,24881,2732,10.980266,2960,11.896628,3268,13.134520,3201,12.865239,3074,12.354809,3278,13.174712,3076,12.362847,2244,9.018930,1048,4.212049,12209,49.069571,12042,48.398376,164,0.659137,113,0.454162,46,0.184880,307,1.233873,24881,-277,2820,10.4,12.9,-2.5,0.5,-9.1,-8.6,4901,6486,4566,2220,27.0,35.7,25.1,12.2,6788,30.9,43.9,36.7,34382,31157,37607,8373,7940,433,5.2,34382,68.9,217.1,196.7,77.2,71.2,7.6,6.9,12.914231,2.723340,0.514126,11.631462,9.508784,32.283033,5,26201,6150,23.5,Barbour,Alabama,1,5,19689,57.5,55.6,59.1,11325,40.7,39.5,41.9,8013,11.0,10.1,11.8,2159,12.1,10.7,13.3,2373,19.7,18.6,20.6,3870,4.5,4.2,4.8,887,6
3,1007,22400,2456,10.964286,2596,11.589286,3029,13.522321,3113,13.897321,3038,13.562500,3115,13.906250,2545,11.361607,1723,7.691964,785,3.504464,17211,76.834821,4770,21.294643,98,0.437500,53,0.236607,26,0.116071,242,1.080357,22400,-155,2151,11.1,11.4,-0.3,0.4,-7.0,-6.6,2650,7471,3846,1813,16.8,47.3,24.4,11.5,4400,21.8,27.8,26.3,46064,41283,50845,8661,8317,344,4.0,46064,92.3,217.1,196.7,77.2,71.2,7.6,6.9,11.626493,2.451783,0.462860,10.471635,8.560619,29.063942,0,22580,4773,21.1,Bibb,Alabama,1,7,17813,51.6,49.6,53.4,9190,38.7,37.4,40.2,6894,8.6,7.9,9.3,1533,10.0,8.8,11.3,1789,14.1,13.2,14.9,2511,3.3,3.1,3.6,595,2
4,1009,57840,7095,12.266598,7570,13.087828,6742,11.656293,6884,11.901798,7474,12.921853,7844,13.561549,6965,12.041840,4931,8.525242,2335,4.036999,55456,95.878285,950,1.642462,378,0.653527,185,0.319848,70,0.121024,801,1.384855,57840,13,489,11.8,11.4,0.3,0.1,-0.2,0.0,7861,13489,13267,5010,19.8,34.0,33.5,12.6,7527,13.2,18.0,15.5,50412,46157,54667,25006,24128,878,3.5,50412,101.1,217.1,196.7,77.2,71.2,7.6,6.9,30.021267,6.330854,1.195171,27.039257,22.104740,75.047251,6,57667,13600,23.6,Blount,Alabama,1,9,44448,46.3,44.3,48.4,20584,34.0,32.4,35.5,15112,9.2,8.4,10.1,4101,10.5,9.1,12.0,4661,13.5,12.6,14.5,6017,3.4,3.2,3.7,1507,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3135,56037,43051,6104,14.178532,6326,14.694200,5359,12.448027,6577,15.277229,5334,12.389956,5261,12.220390,4943,11.481731,2198,5.105572,949,2.204362,40397,93.835219,592,1.375113,665,1.544680,435,1.010429,68,0.157952,894,2.076607,43051,-496,660,12.6,7.0,5.6,1.4,-18.5,-17.2,2549,9433,10108,6243,9.0,33.3,35.7,22.0,3540,8.4,10.0,8.5,73315,67141,79489,21150,20254,896,4.2,73315,117.8,207.9,199.6,79.8,76.2,10.6,10.4,21.311749,11.997173,2.160982,9.836192,28.018244,40.537033,10,44527,7015,15.8,Sweetwater,Wyoming,56,37,31761,40.3,38.5,42.2,12796,33.6,32.1,35.1,10672,5.9,5.4,6.4,1862,6.6,5.7,7.7,2098,8.9,8.3,9.6,2834,2.6,2.4,2.8,821,5
3136,56039,23081,2384,10.328842,2185,9.466661,2967,12.854729,4093,17.733200,3423,14.830380,2928,12.685759,2875,12.456133,1569,6.797799,657,2.846497,21971,95.190850,175,0.758199,211,0.914172,326,1.412417,35,0.151640,363,1.572722,23081,-180,270,9.9,3.2,6.7,1.6,-16.2,-14.5,958,2272,4087,9847,5.6,13.2,23.8,57.4,1443,6.3,6.7,5.9,99087,88142,110032,15340,14876,464,3.0,99087,159.2,207.9,199.6,79.8,76.2,10.6,10.4,11.425901,6.432063,1.158570,5.273493,15.021465,21.733183,6,22923,4298,18.7,Teton,Wyoming,56,39,18842,24.3,22.4,26.1,4576,19.4,18.1,20.8,3655,5.2,4.6,5.8,981,4.9,4.0,6.0,928,7.2,6.5,8.0,1360,2.4,2.2,2.6,447,5
3137,56041,20299,3121,15.375142,3205,15.788955,2153,10.606434,2702,13.311001,2390,11.773979,2434,11.990738,2554,12.581901,1233,6.074191,507,2.497660,19356,95.354451,168,0.827627,288,1.418789,96,0.472930,30,0.147791,361,1.778413,20299,-157,220,13.5,7.2,6.3,0.0,-14.0,-14.1,934,5176,4810,1995,7.2,40.1,37.2,15.4,2010,10.0,12.1,10.6,63401,56812,69990,8931,8541,390,4.4,63401,101.8,207.9,199.6,79.8,76.2,10.6,10.4,10.048714,5.656794,1.018926,4.637868,13.210897,19.113638,6,20758,3554,17.1,Uinta,Wyoming,56,41,14455,42.2,40.0,44.2,6101,34.2,32.5,35.7,4944,7.2,6.5,7.9,1034,8.0,7.0,9.2,1163,10.4,9.5,11.2,1500,3.0,2.8,3.2,430,5
3138,56043,7885,858,10.881420,1113,14.115409,715,9.067850,903,11.452124,900,11.414077,1085,13.760304,1118,14.178821,741,9.397590,452,5.732403,7450,94.483196,40,0.507292,143,1.813570,65,0.824350,8,0.101458,179,2.270133,7885,-150,140,9.3,10.7,-1.4,-0.6,-16.6,-17.2,590,1717,2135,1220,10.4,30.3,37.7,21.5,918,11.9,16.1,14.5,55190,49263,61117,4058,3886,172,4.2,55190,88.6,207.9,199.6,79.8,76.2,10.6,10.4,3.903350,2.197341,0.395794,1.801546,5.131678,7.424555,0,8253,2206,26.7,Washakie,Wyoming,56,43,6092,37.5,35.3,39.5,2285,28.4,27.0,29.8,1730,8.2,7.4,9.0,500,8.3,7.1,9.6,506,11.3,10.3,12.1,686,3.4,3.2,3.7,207,6


## 3. Exploration and Data Cleaning <a id='c3'></a>

In [7]:
total_data = get_data(URL)

#### 3.1 Drop Duplicate Information <a id='c3-1'></a>

In this case, we don't delete any row beacause there is no duplicated data

In [11]:
nan_columns = total_data.isna().sum()
print(nan_columns[nan_columns > 0])

Series([], dtype: int64)


#### 3.2 Drop Irrelevant Information <a id='c3-2'></a>