**Imports**:

In [116]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

**Code:**

Funzioni:

In [117]:
def stampaPercentDF(df):
    for name, group in df:
    # La funzione value_counts() ritorna la percentuale di apparizione per ogni elemento univoco,
    # è come un groupby però ci aggiunge le percentuali (se aggiungo il parametro normalize) altrimenti ritornebbe solo il numero di volte in cui i valori appaiono
        group = round(group['Severity'].value_counts(normalize=True) * 100, 2)
        print(name)
        print(group)
        print('-'*10)

Dataframes

In [118]:
df_train = pd.read_csv('medical-train.csv')
df_test = pd.read_csv('medical-test.csv')

Elimino tutti i valori Nan dovuti dall'eliminazione dei dati

In [119]:
df_train = df_train[~(pd.isna(df_train.iloc[:,0]))]
df_test = df_test[~(pd.isna(df_test.iloc[:,0]))]

Drop di colonne / righe non utili allo studio dei dati

In [120]:
df_train = df_train.drop(df_train[df_train['Type'] == 'Moving average'].index, axis=0)
df_train = df_train.drop(['Series_reference', 'Validation', 'Indicator', 'Data_value', 'Lower_CI', 'Upper_CI', 'Type'], axis=1)

In [121]:
df_test = df_test.drop(df_test[df_test['Type'] == 'Moving average'].index, axis=0)
df_test = df_test.drop(['Series_reference', 'Validation', 'Indicator', 'Data_value', 'Lower_CI', 'Upper_CI', 'Type'], axis=1)

In [122]:
print(df_train.columns.values)
print(df_test.columns.values)

['Period' 'Units' 'Cause' 'Population' 'Age' 'Severity']
['Period' 'Units' 'Cause' 'Population' 'Age' 'Severity']


Raggruppo i dati

Dati di Train

In [123]:
units_map = {
    'Injuries': 1,
    'Per 100,000 FTEs': 2,
    'Per 100,000 people': 3,
    'Per billion km': 4,
    'Per thousand registered vehicles': 5
}
df_train['Units'] = df_train['Units'].map(units_map)

# ! Abbiamo tirato via Indicator perchè le percentuali erano molto simili alla colonna Units e un analisi più approfondita ha rivelato che le stesse unità avevano per la maggior parte gli stessi indicatori

pop_map = {
    'Maori': 1,
    'Whole pop': 2
}
df_train['Population'] = df_train['Population'].map(pop_map)

cause_map = {
    'All': 1,
    'Assault': 2,
    'Drowing': 3,
    'Falls': 4,
    'Intentional self-harm': 5,
    'Motor vehicle traffic crashes': 6,
    'Work': 7
}
df_train['Cause'] = df_train['Cause'].map(cause_map)

age_map = {
    '0-74 years': 1,
    '75+ years': 2,
    'All ages': 3
}
df_train['Age'] = df_train['Age'].map(age_map)

severity_map = {
    'Fatal': 1,
    'Serious non-fatal': 2,
    'Serious': 3
}
df_train['Severity'] = df_train['Severity'].map(severity_map)

print(df_train)

     Period  Units  Cause  Population  Age  Severity
450    2000      1      1           2    3         1
451    2001      1      1           2    3         1
452    2002      1      1           2    3         1
453    2003      1      1           2    3         1
454    2004      1      1           2    3         1
...     ...    ...    ...         ...  ...       ...
1942   2012      1      4           1    1         3
1943   2013      1      4           1    1         3
1944   2014      1      4           1    1         3
1945   2015      1      4           1    1         3
1946   2016      1      4           1    1         3

[1497 rows x 6 columns]


Dati di Test

In [124]:
print('Units')
units_map = {
    'Injuries': 1,
    'Per 100,000 FTEs': 2,
    'Per 100,000 people': 3,
    'Per billion km': 4,
    'Per thousand registered vehicles': 5
}
df_test['Units'] = df_test['Units'].map(units_map)

# ! Abbiamo tirato via Indicator perchè le percentuali erano molto simili alla colonna Units e un analisi più approfondita ha rivelato che le stesse unità avevano per la maggior parte gli stessi indicatori

pop_map = {
    'Maori': 1,
    'Whole pop': 2
}
df_test['Population'] = df_test['Population'].map(pop_map)

cause_map = {
    'All': 1,
    'Assault': 2,
    'Drowing': 3,
    'Falls': 4,
    'Intentional self-harm': 5,
    'Motor vehicle traffic crashes': 6,
    'Work': 7
}
df_test['Cause'] = df_test['Cause'].map(cause_map)

age_map = {
    '0-74 years': 1,
    '75+ years': 2,
    'All ages': 3
}
df_test['Age'] = df_test['Age'].map(age_map)

severity_map = {
    'Fatal': 1,
    'Serious non-fatal': 2,
    'Serious': 3
}
df_test['Severity'] = df_test['Severity'].map(severity_map)

print(df_test.head(10))

Units
   Period  Units  Cause  Population  Age  Severity
0    2017      1    4.0         1.0  1.0         3
1    2018      1    4.0         1.0  1.0         3
2    2000      3    4.0         1.0  1.0         3
3    2001      3    4.0         1.0  1.0         3
4    2002      3    4.0         1.0  1.0         3
5    2003      3    4.0         1.0  1.0         3
6    2004      3    4.0         1.0  1.0         3
7    2005      3    4.0         1.0  1.0         3
8    2006      3    4.0         1.0  1.0         3
9    2007      3    4.0         1.0  1.0         3


Code in progress...

Logistic Regression

In [125]:
Y_test = df_test["Severity"]
Y_train = df_train["Severity"]
X_train = df_train.drop("Severity", axis=1)
X_test  = df_test.drop("Severity", axis=1,errors='ignore').copy()
X_train.shape, Y_train.shape, X_test.shape

((1497, 5), (1497,), (801, 5))

In [126]:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
print(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
print(acc_log)
acc_log = round(logreg.score(X_test, Y_test) * 100, 2)

acc_log

     Period  Units  Cause  Population  Age
0      2017      1    4.0         1.0  1.0
1      2018      1    4.0         1.0  1.0
2      2000      3    4.0         1.0  1.0
3      2001      3    4.0         1.0  1.0
4      2002      3    4.0         1.0  1.0
..      ...    ...    ...         ...  ...
796    2014      1    NaN         NaN  NaN
797    2015      1    NaN         NaN  NaN
798    2016      1    NaN         NaN  NaN
799    2017      1    NaN         NaN  NaN
800    2018      1    NaN         NaN  NaN

[801 rows x 5 columns]
42.89


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values