# EDA: Attractiveness vs Decision (Speed Dating)

Cíl: ukázat, jak hodnocení 
 souvisí s rozhodnutím "decision" (chce se znovu setkat).

Soubor: `speeddating.csv`

In [22]:
# Import knihoven
{
    "cells": [
        {
            "cell_type": "markdown",
            "id": "#VSC-a8158a74",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "# EDA: Attractiveness vs Decision (Speed Dating)",
                "",
                "Cíl: ukázat, jak hodnocení `attractive_partner` souvisí s rozhodnutím `decision` (chce se znovu setkat).",
                "",
                "Soubor: `speeddating.csv`"
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-fa212b07",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# Import knihoven",
                "import pandas as pd",
                "import numpy as np",
                "import seaborn as sns",
                "import matplotlib.pyplot as plt",
                "%matplotlib inline",
                "",
                "# Zkontrolujeme, jestli je nainstalovane scikit-learn; pokud ne, nastavime priznak a vytiskneme instalacni prikaz",
                "try:",
                "    import sklearn",
                "    SKLEARN_AVAILABLE = True",
                "except Exception:",
                "    SKLEARN_AVAILABLE = False",
                "    print('scikit-learn (sklearn) neni nainstalovany v tomto prostredi.')",
                "    print('Pro nainstalovani spusťte v PowerShellu:')",
                "    print('python -m pip install -U scikit-learn')",
                "    print('nebo: pip install -U scikit-learn')",
                ""
            ]
        },
        {
            "cell_type": "code",
            "id": "#VSC-ce9a4fdb",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# Nacteni dat",
                "p = r\"c:\\Users\\Root\\Documents\\Python\\kurz_czechitas\\lekce 6\\speeddating.csv\"",
                "df = pd.read_csv(p, encoding='utf-8', low_memory=False)",
                "print('Rows, cols:', df.shape)",
                "# Zobrazime prvnich par radku a zjistime typy",
                "display(df.head())",
                "print('\nDtypes:')",
                "print(df.dtypes.value_counts())",
                "print('\nColumns sample:', list(df.columns)[:40])"
            ]
        },
        {
            "cell_type": "markdown",
            "id": "#VSC-e45e4f01",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "## Prompt pro AI nastroj (vlozte do AI toolu)",
                "",
                "Název tabulky: `speeddating`",
                "",
                "Sloupce a významy:",
                "- `attractive_partner`: hodnocení atraktivity protějšku (číslo)",
                "- `decision`: binární rozhodnutí, zda se chce setkat znovu (1 = ano, 0 = ne)",
                "",
                "Požadavek: Vytvoř prosím kód v Pythonu používající knihovny `pandas` a `seaborn`, který vytvoří vizualizaci znázorňující vztah mezi `attractive_partner` a `decision`. Navrhni vhodnou vizualizaci (např. rozdělení `attractive_partner` podle `decision`, boxploty, violinplot, nebo logistická křivka pravděpodobnosti).",
                "",
                "Prosím, uveď několik návrhů vizualizací a vyber jeden z nich (vysvětli, proč). Pak vygeneruj kód, který vykreslí zvolenou vizualizaci."
            ]
        },
        {
            "cell_type": "code",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# Základní čištění a statistiky pro 'attractive_partner' a 'decision'",
                "df['attractive_partner_numeric'] = pd.to_numeric(df.get('attractive_partner'), errors='coerce')",
                "print('original non-null (raw):', df['attractive_partner'].notna().sum(), '-> numeric non-null:', df['attractive_partner_numeric'].notna().sum())",
                "print('\nDecision value counts:')",
                "print(df['decision'].value_counts(dropna=False))",
                "print('\nBasic stats (numeric attractive_partner):')",
                "print(df['attractive_partner_numeric'].describe())",
                "",
                "# Ulozime cistejsi subset pro dalsi analyzy",
                "sub = df[['attractive_partner_numeric','decision']].copy()",
                "sub = sub.dropna()",
                "print('\nSubset for analysis shape:', sub.shape)"
            ]
        },
        {
            "cell_type": "code",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# Vizualizace distribuce (KDE) a histogramu podle decision",
                "plt.figure(figsize=(10,5))",
                "if sub['attractive_partner_numeric'].empty:",
                "    print('No numeric attractive_partner values available; showing countplot of original values')",
                "    sns.countplot(x='attractive_partner', hue='decision', data=df)",
                "    plt.xlabel('Attractive partner (categorical)')",
                "else:",
                "    sns.kdeplot(sub.loc[sub['decision']==1, 'attractive_partner_numeric'], label='decision=1', fill=True),",
                "    sns.kdeplot(sub.loc[sub['decision']==0, 'attractive_partner_numeric'], label='decision=0', fill=True),",
                "    plt.xlabel('Attractive partner (numeric)')",
                "    plt.title('Rozdeleni hodnoceni atraktivity podle rozhodnuti')",
                "plt.legend()",
                "plt.tight_layout()",
                "plt.savefig('attractive_partner_kde_by_decision.png')",
                "plt.show()"
            ]
        },
        {
            "cell_type": "code",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# Boxplot pro porovnani medianu/rozpeti",
                "plt.figure(figsize=(8,6))",
                "sns.boxplot(x='decision', y='attractive_partner_numeric', data=sub.rename(columns={'attractive_partner_numeric':'attractive_partner'}))",
                "plt.xlabel('Decision')",
                "plt.ylabel('Attractive partner (numeric)')",
                "plt.title('Boxplot: attractive_partner vs decision')",
                "plt.tight_layout()",
                "plt.savefig('attractive_partner_boxplot.png')",
                "plt.show()"
            ]
        },
        {
            "cell_type": "code",
            "metadata": {
                "language": "python"
            },
            "source": [
                "# Pokus o logistickou regresi (pokud je sklearn dostupny), jinak binned fallback",
                "auc = None",
                "if not SKLEARN_AVAILABLE:",
                "    print('scikit-learn neni dostupny: vykreslim binned probability jako fallback')",
                "    bins = pd.cut(sub['attractive_partner_numeric'], bins=10)",
                "    prob = sub.assign(decision=pd.to_numeric(sub['decision'], errors='coerce')).groupby(bins)['decision'].mean()",
                "    count = sub.groupby(bins)['decision'].count()",
                "    plt.figure(figsize=(8,5))",
                "    prob.plot(marker='o')",
                "    plt.xlabel('Attractive partner (binned)')",
                "    plt.ylabel('P(decision=1)')",
                "    plt.title('Binned estimate of P(decision=1) by attractiveness (fallback)')",
                "    plt.tight_layout()",
                "    plt.savefig('attractive_partner_binned_prob.png')",
                "    plt.show()",
                "else:",
                "    from sklearn.linear_model import LogisticRegression",
                "    from sklearn.model_selection import train_test_split",
                "    from sklearn.metrics import roc_auc_score",
                "    X = sub[['attractive_partner_numeric']].rename(columns={'attractive_partner_numeric':'attractive_partner'})",
                "    y = pd.to_numeric(sub['decision'], errors='coerce')",
                "    mask = X['attractive_partner'].notna() & y.notna()",
                "    X = X.loc[mask]",
                "    y = y.loc[mask]",
                "    if X['attractive_partner'].nunique() < 2 or len(X) < 10:",
                "        print('Not enough variation or too few rows; skipping logistic fit')",
                "    else:",
                "        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)",
                "        model = LogisticRegression(solver='liblinear')",
                "        model.fit(X_train, y_train)",
                "        probs = model.predict_proba(X_test)[:,1]",
                "        auc = roc_auc_score(y_test, probs)",
                "        print(f'Logistic regression AUC = {auc:.3f}')",
                "        # plot logistic curve over range",
                "        vals = np.linspace(X['attractive_partner'].min(), X['attractive_partner'].max(), 200)",
                "        preds = model.predict_proba(vals.reshape(-1,1))[:,1]",
                "        plt.figure(figsize=(8,5))",
                "        sns.lineplot(x=vals, y=preds)",
                "        plt.xlabel('Attractive partner')",
                "        plt.ylabel('P(decision=1)')",
                "        plt.title(f'Logistic fit (AUC={auc:.2f})')",
                "        plt.tight_layout()",
                "        plt.savefig('attractive_partner_logistic_fit.png')",
                "        plt.show()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "language": "markdown"
            },
            "source": [
                "## Shrnutí",
                "V tomto notebooku jsme načetli data, převedli sloupec `attractive_partner` na numerickou podobu, a porovnali rozdělení hodnocení podle `decision`. Pokud `scikit-learn` není dostupný, vykreslíme binned (skupinový) odhad pravděpodobnosti rozhodnutí; pokud je k dispozici a máme dost dat, fitneme jednoduchou logistickou regresi a vykreslíme křivku pravděpodobnosti.\n",
                "\n",
                "Uloženy obrázky: `attractive_partner_kde_by_decision.png`, `attractive_partner_boxplot.png`, a buď `attractive_partner_binned_prob.png` nebo `attractive_partner_logistic_fit.png`."
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "name": "python",
            "version": "3.10"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 5
}

{'cells': [{'cell_type': 'markdown',
   'id': '#VSC-a8158a74',
   'metadata': {'language': 'markdown'},
   'source': ['# EDA: Attractiveness vs Decision (Speed Dating)',
    '',
    'Cíl: ukázat, jak hodnocení `attractive_partner` souvisí s rozhodnutím `decision` (chce se znovu setkat).',
    '',
    'Soubor: `speeddating.csv`']},
  {'cell_type': 'code',
   'id': '#VSC-fa212b07',
   'metadata': {'language': 'python'},
   'source': ['# Import knihoven',
    'import pandas as pd',
    'import numpy as np',
    'import seaborn as sns',
    'import matplotlib.pyplot as plt',
    '%matplotlib inline',
    '',
    '# Zkontrolujeme, jestli je nainstalovane scikit-learn; pokud ne, nastavime priznak a vytiskneme instalacni prikaz',
    'try:',
    '    import sklearn',
    '    SKLEARN_AVAILABLE = True',
    'except Exception:',
    '    SKLEARN_AVAILABLE = False',
    "    print('scikit-learn (sklearn) neni nainstalovany v tomto prostredi.')",
    "    print('Pro nainstalovani spusťte v PowerS