# Döntési fa napi eladott darabszámra vetítve

### Feladat összefoglalása

Készítsetek egy döntési fa modellt a tanítóadaton, és adjatok le a predikcióitokat a test setre. A predikciók legyenek 2 oszloposak, 'index' illetve 'pred' névvel ellátva, értelemszerű tartalommal. Az indexben a test adatok indexe szerepeljen.

FELKERÜLTEK LEÍRÁSOK AZ ADATOKHOZ. TOP SECRET, NE KERÜLJÖN KI A KURZUSRÓL!

Használhatjátok a megismert feature engineering módszereket, vagy bármi mást.


A célváltozó: NAPI_ELADOTT_DB
Az indexnek pedig kérlek állítsatok elő egy oszlopot: LAPISSUGL-ARUSKOD formában.

## 1. Alapozó lépések

### 1.1. Könyvtárak behívása

In [1]:
import altair as alt
import graphviz
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction import FeatureHasher
from sklearn import tree
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from pandas_profiling import ProfileReport
from sklearn.model_selection import cross_val_score
from pathlib import Path  

### 1.2. Adatbázis behívása, ismerkedés az adat karakterisztikáival

Adatok behívása

In [2]:
df = pd.read_csv("data_train.csv")
df_Y = pd.read_csv("target_train.csv")
df = df.join(df_Y)

*LAPISSUGL-ARUSKOD* létrehozása

In [3]:
df['LAPISSUGL-ARUSKOD'] = df['LAPISSUGL'].astype(str) + '-' + df['ARUSKOD'].astype(str)
#df = df.rename(columns = {'index':'LAPISSUGL-ARUSKOD'})
df.head()


Unnamed: 0,LAPISSUGL,KFDELDATE,ARUSKOD,NAPOK_POLCON,ELOZO_NAPOK_POLCON,ELOZO_NAPI_ELADOTT_DB,ARUSMEGYE,ARUSTI1,ARUSTI2,ARUSTI3,...,LAPMELL4,LAPMELL5,LAPMELL6,LAPMELL7,LAPMELL8,LAPMELL9,LAPMELL10,LAPFORMAT,NAPI_ELADOTT_DB,LAPISSUGL-ARUSKOD
0,10708900020160010,20160923,415120,34,34,0.088235,26,322800,3228,32,...,,,,,,,,A4,0.088235,10708900020160010-415120
1,10414850020180337,20180406,823031,29,29,0.034483,36,110201,1102,11,...,,,,,,,,A4,0.034483,10414850020180337-823031
2,10708900020200009,20200826,937637,29,34,0.058824,17,220115,2201,22,...,,,,,,,,A4,0.103448,10708900020200009-937637
3,10534000020190003,20190302,823031,28,34,0.0,36,110201,1102,11,...,,,,,,,,A4,0.0,10534000020190003-823031
4,10554880020170006,20170209,108679,8,8,0.5,6,120400,1204,12,...,,,,,,,,A4,0.0,10554880020170006-108679


Pandas profilinggal lecsekkoltam az adatok eloszlását és alapvető karakterisztikáit adattisztítás és későbbi feature engeneering miatt. (A profile kód kikommentelve a gyorsabb futás kedvéért.)

In [4]:
prof = ProfileReport(df)
#prof.to_file(output_file='C:/Rajk - újgép/Machine Learning - újgép/decision_tree/output.html') 

### 1.3. Feature engineering

A pandas profile (pp) alapján első körben kidobtam a 0 varianciával rendelkező változókat

In [5]:
for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)
prof = ProfileReport(df)
#prof.to_file(output_file='C:/Rajk - újgép/Machine Learning - újgép/decision_tree/output_2.html')

Ezután kidobom a nagyon magas korrelációval rendelkező változókat (meghagyva mindehol egyet). Ezt is a pp alapján tettem, a correlációs mátrixot a kódban is megjelenítve.

In [6]:
corr = df.corr(method = 'spearman')
def display_correlation(df):
    r = df.corr(method="spearman")
    plt.figure(figsize=(10,6))
    heatmap = sns.heatmap(df.corr(), vmin=-1, 
                      vmax=1)
    plt.title("Spearman Correlation")
    return(r)

r_simple=display_correlation(df)
r_simple

Unnamed: 0,LAPISSUGL,KFDELDATE,ARUSKOD,NAPOK_POLCON,ELOZO_NAPOK_POLCON,ELOZO_NAPI_ELADOTT_DB,ARUSMEGYE,ARUSTI1,ARUSTI2,ARUSTI3,...,sportletesitmeny,LAPMEDTER,LAPTERMCS,LAPMEGJSZ,LAPKIADO,LAPARBRUT,LAPARNET,LAPCS1,LAPCS2,NAPI_ELADOTT_DB
LAPISSUGL,1.0,0.276925,0.027041,-0.369622,-0.376898,0.261639,-0.046289,0.070634,0.069901,0.074951,...,0.048199,0.98979,0.98979,0.276882,-0.18782,-0.106222,-0.106222,-0.480762,-0.447451,0.253113
KFDELDATE,0.276925,1.0,0.00788,0.02023,3.7e-05,-0.013305,-0.047959,0.00512,-0.001412,-0.001223,...,-0.022044,0.153012,0.153012,0.989308,0.130028,0.283815,0.283815,-0.034741,-0.035772,-0.119011
ARUSKOD,0.027041,0.00788,1.0,0.07267,0.077059,-0.130769,0.396054,-0.457172,-0.451558,-0.448456,...,0.072035,0.027124,0.027124,0.004325,-0.044044,0.036656,0.036656,-0.088241,-0.085265,-0.123692
NAPOK_POLCON,-0.369622,0.02023,0.07267,1.0,0.860477,-0.345686,-0.034166,-0.014571,0.007283,0.010869,...,0.047126,-0.374014,-0.374014,-0.009459,-0.319681,0.481827,0.481827,-0.326373,-0.356054,-0.348154
ELOZO_NAPOK_POLCON,-0.376898,3.7e-05,0.077059,0.860477,1.0,-0.356633,-0.032431,-0.018064,0.004209,0.007891,...,0.048541,-0.379674,-0.379674,-0.023819,-0.31547,0.47649,0.47649,-0.32168,-0.350968,-0.332289
ELOZO_NAPI_ELADOTT_DB,0.261639,-0.013305,-0.130769,-0.345686,-0.356633,1.0,0.03883,0.095844,0.051146,0.047288,...,0.215094,0.267805,0.267805,-0.004496,0.0653,-0.211881,-0.211881,-0.01013,0.012665,0.640523
ARUSMEGYE,-0.046289,-0.047959,0.396054,-0.034166,-0.032431,0.03883,1.0,0.086765,0.071418,0.034137,...,-0.021559,-0.041828,-0.041828,-0.040439,0.07163,0.033577,0.033577,0.093367,0.092099,0.038345
ARUSTI1,0.070634,0.00512,-0.457172,-0.014571,-0.018064,0.095844,0.086765,1.0,0.964131,0.962064,...,0.200479,0.072318,0.072318,0.004371,0.002616,0.021445,0.021445,-0.030105,-0.027512,0.100023
ARUSTI2,0.069901,-0.001412,-0.451558,0.007283,0.004209,0.051146,0.071418,0.964131,1.0,0.997856,...,0.207938,0.072328,0.072328,-0.003669,-0.012706,0.025166,0.025166,-0.050912,-0.048229,0.054423
ARUSTI3,0.074951,-0.001223,-0.448456,0.010869,0.007891,0.047288,0.034137,0.962064,0.997856,1.0,...,0.208385,0.077445,0.077445,-0.00454,-0.021491,0.019911,0.019911,-0.062063,-0.05923,0.050935


Ezek alapján a kidobott változók: *ARUSTI2, ARUSTI3, ARUSTI4, LAPMEDTER, LAPTERMCS, LAPARNET, LAPCS2*.

Újabb pp futtatás a validációhoz.

In [7]:
df = df.drop(["ARUSTI2", "ARUSTI3", "ARUSTI4", "LAPMEDTER", "LAPTERMCS", "LAPARNET", "LAPCS2"], axis = 1)
prof = ProfileReport(df)
#prof.to_file(output_file='C:/Rajk - újgép/Machine Learning - újgép/decision_tree/output_4.html')

#### Encodeolások előkészítése

Az elemzésben végül csak Hash encode-ot használtam, így a többi encode előkészítése kitörlésre került.

#### Hash encode-olás előskészítése

In [8]:
fh_n_features = 1000

x_colnames_fh = [f'fh{n}' for n in range(fh_n_features)]
fh = FeatureHasher(n_features=fh_n_features, input_type='string', dtype=np.float32)

Maga az encode

In [9]:
x_cols_names = ["NAPOK_POLCON","ELOZO_NAPOK_POLCON","ELOZO_NAPI_ELADOTT_DB","ARUSMEGYE","ARUSTI1", "bevasarlokozpont","irodahazhivatal", "parkjatszoter","piac", "LAPKIADO"]
fh_vec = fh.fit_transform(X=df['CNETWORK'])
fh_vec = pd.DataFrame(fh_vec.todense(), columns=x_colnames_fh, index=df.index)

fh_vec_test = fh.fit_transform(X=df['CNETWORK'])
fh_vec_test = pd.DataFrame(fh_vec_test.todense(), columns=x_colnames_fh, index=df.index)

df[x_colnames_fh] = fh_vec
df[x_colnames_fh] = fh_vec_test
x_fin_colnames = x_cols_names + x_colnames_fh

  self[k1] = value[k2]


### 1.4. Később felhasználandó függvények definiálása

In [10]:
df_x = df.drop(["NAPI_ELADOTT_DB"], axis = 1)

dataset_size = df.shape[0]
max_depths = list(range(1, 30))
params = {"max_depth": 5}
y_colname = "NAPI_ELADOTT_DB"
init_x_colname = df_x.columns

In [11]:
def train_test_split_func(y_colname=y_colname, df=df):
    retlist = train_test_split(
        df.drop(y_colname, axis=1).values,
        df[y_colname].values,
        test_size=0.2,
        random_state=42,
    )

    return [
        pd.DataFrame(f, columns=[f for f in df.columns if not f == y_colname])
        if i < 2
        else pd.DataFrame(f, columns=[y_colname])
        for i, f in enumerate(retlist) ]


X_train, X_test, y_train, y_test = train_test_split_func(y_colname=y_colname, df=df)

In [12]:
def dec_tree_rmse(
    params,
    x_colnames,
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
):
    """
    Fit decision tree with params, and returns rmse and plot if plot.
    """
    dec_tree = DecisionTreeRegressor(**params).fit(X_train.loc[:, x_colnames], y_train)
    test_preds = dec_tree.predict(X_test.loc[:, x_colnames])
    rmse = mean_squared_error(y_true=y_test, y_pred=test_preds, squared=False)

    #source = pd.DataFrame(
        #{"y": y_test["y"], "y_pred": test_preds, "cat": X_test["cat"]}
    #)

    return rmse, dec_tree

def linreg_rmse(
    x_colnames,
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
):
    """
    Fit decision tree with params, and returns rmse and plot if plot.
    """
    linreg = LinearRegression().fit(X_train.loc[:, x_colnames], y_train[y_colname])
    test_preds = linreg.predict(X_test.loc[:, x_colnames])
    rmse = mean_squared_error(y_true=y_test, y_pred=test_preds, squared=False)

    source = pd.DataFrame(
        {"y": y_test["y"], "y_pred": test_preds, "cat": X_test["cat"]}
    )

    return rmse, source, linreg


def create_scatter(source):
    """
    Creates a scatterplot of true values and predictions colored by `cat` column.
    Colnames have to be `y` and `y_pred` respectively.
    """
    maxi_val = source[['y', 'y_pred']].max().max() + 2
    mini_val = source[['y', 'y_pred']].min().min() - 2
    return alt.Chart(source).mark_circle(
        size=60).encode(
        x=alt.X('y:Q', scale=alt.Scale(domain=(mini_val, maxi_val))),
        y=alt.Y("y_pred:Q", scale=alt.Scale(domain=(mini_val, maxi_val))), 
        color="cat", 
        tooltip="cat").properties(
        width=800, height=500).interactive()


def plot_max_depth_rmse(list_of_depths, list_of_rmse):
    """
    Creates a lineplot from 2 list of numeric values.
    """
    source = pd.DataFrame(index=range(len(list_of_rmse)), columns=["max_depth", "rmse"])
    source["rmse"] = list_of_rmse
    source["max_depth"] = list_of_depths
    return alt.Chart(source).mark_line().encode(x="max_depth", y="rmse").properties(
        width=800, height=500).interactive()


# https://mljar.com/blog/visualize-decision-tree/
def visualize_dec_tree(decision_tree):
    """
    Returns visualization of a decision tree
    """
    dot_data = tree.export_graphviz(
        decision_tree, out_file=None, feature_names=decision_tree.feature_names_in_, filled=True
    )
    return graphviz.Source(dot_data, format="png")

def create_encoded_column(encoder, colname_s, X_train=X_train, y_train=y_train):
    """
    Use encoder to fit_transform the `cat` column and insert the encoded column values
    to `x_colname` col in both train and test sets. Only fits on train, transforms test.
    """
    # you fit on training data
    X_train[colname_s] = encoder.fit_transform(X_train["cat"].values.reshape(-1, 1))

    # but only TRANSFORM the test data
    X_test[colname_s] = encoder.transform(X_test["cat"].values.reshape(-1, 1))
    
    return X_train, X_test

## 2. A döntési fa megégépítése

### 2.1. Felhasznált magyarázó változók kiválasztása

A preprocess lépései alapján kiválasztott változók a következőek:

In [13]:
x_cols_names = ["NAPOK_POLCON","ELOZO_NAPOK_POLCON","ELOZO_NAPI_ELADOTT_DB","ARUSMEGYE","ARUSTI1","CNETWORK", "bevasarlokozpont","irodahazhivatal", "parkjatszoter","piac", "LAPMEGJSZ", "LAPKIADO"]

Ezek közül a CNETWORK változót kell encodeolni. Mivel string, az óraiak alapján Hash Encode-olást végeztem el rajta. (Mivel ez a lépés egy preprocess lépés, átemeltem a kód elejére.)

### 2.2. A döntési fa megépítése

Ehhez először megalkotom a dataframe-et, amin tanítom

In [14]:
X_train = X_train[X_train.columns.intersection(x_fin_colnames)]

Lecsekkolom, hogy minden változó használható formában van-e

In [15]:
X_train.dtypes.value_counts()
y_train.dtypes.value_counts()
X_test.dtypes.value_counts()

object    1032
dtype: int64

Az X_train-re és az X_testre is kezelni kell az object alapú prediktorokat

In [16]:
X_train.loc[:,X_train.dtypes.loc[lambda x: x == "object"].index] = X_train.loc[:,X_train.dtypes.loc[lambda x: x == "object"].index].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [17]:
X_test = X_test[X_test.columns.intersection(x_fin_colnames)]

In [18]:
X_test.loc[:,X_test.dtypes.loc[lambda x: x == "object"].index] = X_test.loc[:,X_test.dtypes.loc[lambda x: x == "object"].index].astype(int)

Végső ellenőrzés

In [19]:
X_train.dtypes.value_counts()
X_test.dtypes.value_counts()

object    1010
dtype: int64

Minden rendben van

### 2.3. RMSE és vágás lecsekkolása

In [20]:
rmses_oe = [dec_tree_rmse(params={"max_depth": i}, x_colnames=x_fin_colnames)[0] for i in max_depths]
plot_max_depth_rmse(max_depths, rmses_oe)

Ezek alapján 8-as vágás éri meg a legjobban.

### 2.4. Cross validáció

In [21]:
features = x_fin_colnames

x = df[features]
y = df['NAPI_ELADOTT_DB']

depth = []
for i in range(3,20):
    clf = tree.DecisionTreeRegressor(max_depth=i)
    # Perform 7-fold cross validation 
    scores = cross_val_score(estimator=clf, X=x, y=y, cv=7, n_jobs=4)
    depth.append((i,scores.mean()))
print(depth)

[(3, 0.5642705148938145), (4, 0.5871901615467718), (5, 0.5796969621479457), (6, 0.5860202987003864), (7, 0.6051816481819979), (8, 0.5910005374405697), (9, 0.6085377342462214), (10, 0.5981943579495302), (11, 0.5901631508136879), (12, 0.5742935818790398), (13, 0.5732467258269637), (14, 0.5823114202176252), (15, 0.5672669989837312), (16, 0.5733172723086634), (17, 0.5744219577374423), (18, 0.5708759609263023), (19, 0.5702070448029916)]


### 2.5. A fa megépítése

In [22]:
clf = DecisionTreeRegressor()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

A fa kiplottolása

## 3. Előrejelzés

### 3.1. Előrejelzés először a train adaton

In [23]:
#Predict the response for test dataset
y_pred = clf.predict(X_test)

Működik

### 3.2. Előrejelzés elkészítése a teszt adatra

Teszt adat behívása

In [31]:
df_test = pd.read_csv("data_test.csv")
df_test['LAPISSUGL-ARUSKOD'] = df_test['LAPISSUGL'].astype(str) + '-' + df_test['ARUSKOD'].astype(str)

Segéd tábla kimentve index változóhoz

Teszt adat használható formára hozása

In [25]:
df_test.dtypes.value_counts()

for col in df_test.columns:
    if len(df_test[col].unique()) == 1:
        df_test.drop(col,inplace=True,axis=1)

df_test = df_test.drop(["ARUSTI2", "ARUSTI3", "ARUSTI4", "LAPMEDTER", "LAPTERMCS", "LAPARNET", "LAPCS2"], axis = 1)

x_cols_names = ["NAPOK_POLCON","ELOZO_NAPOK_POLCON","ELOZO_NAPI_ELADOTT_DB","ARUSMEGYE","ARUSTI1", "bevasarlokozpont","irodahazhivatal", "parkjatszoter","piac", "LAPKIADO"]

fh_vec_test = fh.fit_transform(X=df_test['CNETWORK'])
fh_vec_test = pd.DataFrame(fh_vec_test.todense(), columns=x_colnames_fh, index=df_test.index)

df_test[x_colnames_fh] = fh_vec
df_test[x_colnames_fh] = fh_vec_test
x_fin_colnames = x_cols_names + x_colnames_fh

df_test = df_test[df_test.columns.intersection(x_fin_colnames)]

  self[k1] = value[k2]


Végső predikció

In [26]:
#Predict the response for test dataset
fin_pred = clf.predict(df_test)
fin_pred

array([0.02222222, 0.0950495 , 0.00555556, ..., 1.1968254 , 0.08125   ,
       0.03448275])

### 3.3. Házi formátum létrehozása

In [27]:
fin_pred

array([0.02222222, 0.0950495 , 0.00555556, ..., 1.1968254 , 0.08125   ,
       0.03448275])

## TEST METRIKA

In [35]:
hazi_data = pd.DataFrame({'index':df_test["LAPISSUGL-ARUSKOD"],'pred':fin_pred}).set_index('index')

In [36]:
target_test = pd.read_csv("target_test.csv").drop('Unnamed: 0', axis=1).set_index('index')

In [38]:
# ez is korrekt, keverve lett leadva...
mean_squared_error(target_test.join(hazi_data)['y_true'], target_test.join(hazi_data)['pred'], squared=False)

0.14812605956511926

A predikció elmentése a házi_data df-be

In [127]:
hazi_data = pd.DataFrame({'index':df_test_index["LAPISSUGL-ARUSKOD"],'pred':fin_pred})
filepath = Path('C:/Rajk - újgép/Machine Learning - újgép/decision_tree/prediction_KD.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
hazi_data.to_csv(filepath)  