# Kapitola 24: Linearni regrese - Predikce z dat

## Jak AI predpovida spojite hodnoty

Predstavte si, ze jste ucitel a snazite se odhadnout, jakou znamku dostane student, ktery se ucil 10 hodin. Mate data od ostatnich studentu - kdo se kolik ucil a jakou znamku dostal.

**Dokázete najit trend?** Linearni regrese presne to dela!

---

### Co se naucime:
1. **Linearni regrese** - co to je a jak funguje
2. **Rovnice primky** - y = mx + b
3. **Metoda nejmensich ctvercu** - jak najit nejlepsi primku
4. **Metriky kvality** - MSE, RMSE, R2 skore
5. **Vicenasobna regrese** - vice promennych
6. **Prakticke priklady** - predikce cen, znamek, teploty

### Proc je linearni regrese dulezita?
- Zakladni stavebni kamen ML
- Snadno interpretovatelna
- Pouziva se v ekonomii, vede, byznysu...

## 1. Instalace a import knihoven

In [None]:
# Instalace knihoven (pro Google Colab)
!pip install pandas numpy matplotlib seaborn scikit-learn -q

print("Knihovny nainstalovany!")

In [None]:
# Import knihoven
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import PolynomialFeatures

# Nastaveni vizualizaci
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
np.random.seed(42)

print("Vsechny knihovny uspesne nacteny!")

## 2. Co je linearni regrese?

**Linearni regrese** hleda primku, ktera nejlepe vyhovuje nasim datum.

### Rovnice primky:

```
y = β₁·x + β₀
```

Kde:
- **y** = hodnota, kterou chceme predpovedet (napr. znamka)
- **x** = hodnota, kterou zname (napr. hodiny studia)
- **β₁** = smernice (sklon primky) - "o kolik se zmeni y, kdyz se x zmeni o 1"
- **β₀** = prusecik s osou y - "hodnota y, kdyz x = 0"

In [None]:
# Vizualizace konceptu linearni regrese
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Data
np.random.seed(42)
x_demo = np.linspace(0, 10, 20)
y_demo = 2 * x_demo + 3 + np.random.normal(0, 2, 20)

# Graf 1: Pouze data
axes[0].scatter(x_demo, y_demo, c='blue', s=80, alpha=0.7)
axes[0].set_xlabel('X (vstup)')
axes[0].set_ylabel('Y (vystup)')
axes[0].set_title('1. Mame data')

# Graf 2: Ruzne mozne primky
axes[1].scatter(x_demo, y_demo, c='blue', s=80, alpha=0.7)
axes[1].plot(x_demo, 1.5*x_demo + 5, 'r--', alpha=0.5, label='Primka 1')
axes[1].plot(x_demo, 2.5*x_demo + 1, 'g--', alpha=0.5, label='Primka 2')
axes[1].plot(x_demo, 2*x_demo + 3, 'orange', linewidth=2, label='Nejlepsi primka')
axes[1].set_xlabel('X (vstup)')
axes[1].set_ylabel('Y (vystup)')
axes[1].set_title('2. Hledame nejlepsi primku')
axes[1].legend()

# Graf 3: Vysledek
axes[2].scatter(x_demo, y_demo, c='blue', s=80, alpha=0.7)
y_pred_demo = 2*x_demo + 3
axes[2].plot(x_demo, y_pred_demo, 'orange', linewidth=2, label='y = 2x + 3')
# Residua
for i in range(len(x_demo)):
    axes[2].plot([x_demo[i], x_demo[i]], [y_demo[i], y_pred_demo[i]], 'r-', alpha=0.3)
axes[2].set_xlabel('X (vstup)')
axes[2].set_ylabel('Y (vystup)')
axes[2].set_title('3. Minimalizujeme chyby (cervene cary)')
axes[2].legend()

plt.suptitle('Koncept linearni regrese', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 3. Prakticky priklad: Predikce studijnich vysledku

Mame data o studentech:
- **Vstup (X):** Pocet hodin studia
- **Vystup (Y):** Znamka na zkousce (body)

In [None]:
# Nase data
hodiny_studia = np.array([2.5, 5.1, 3.2, 8.5, 6.5, 9.2, 5.5, 8.3, 2.7, 7.7, 
                          5.9, 6.1, 4.5, 3.3, 1.1, 8.9, 2.5, 1.9, 6.1, 7.4])
ziskana_znamka = np.array([21, 47, 27, 75, 62, 88, 60, 81, 25, 85, 
                           62, 67, 41, 30, 17, 95, 30, 24, 67, 69])

# Vytvorime DataFrame
studenti = pd.DataFrame({
    'hodiny_studia': hodiny_studia,
    'znamka': ziskana_znamka
})

print("=" * 50)
print("DATA O STUDENTECH")
print("=" * 50)
print(studenti.head(10))
print(f"\nCelkem studentu: {len(studenti)}")
print(f"\nZakladni statistiky:")
print(studenti.describe().round(2))

In [None]:
# Vizualizace dat
plt.figure(figsize=(10, 6))
plt.scatter(studenti['hodiny_studia'], studenti['znamka'], 
            c='coral', s=100, alpha=0.7, edgecolors='black')
plt.xlabel('Hodiny studia')
plt.ylabel('Znamka (body)')
plt.title('Vztah mezi hodinami studia a znamkou')
plt.grid(True, alpha=0.3)

# Pridame anotaci
plt.annotate('Vidime jasny trend!\nVice studia = lepsi znamka', 
             xy=(2, 80), fontsize=12, color='green')

plt.show()

print("\nZ grafu je videt linearni vztah mezi studiem a znamkou!")

In [None]:
# Trenink linearniho regresniho modelu
print("=" * 50)
print("TRENINK MODELU LINEARNI REGRESE")
print("=" * 50)

# Priprava dat
X = studenti[['hodiny_studia']]  # 2D array pro sklearn
y = studenti['znamka']

# Vytvoreni a trenink modelu
model = LinearRegression()
model.fit(X, y)

print("\nModel uspesne natrenovany!")

# Parametry modelu
smernice = model.coef_[0]
prusecik = model.intercept_

print(f"\nRovnice primky:")
print(f"  znamka = {smernice:.2f} * hodiny_studia + {prusecik:.2f}")

print(f"\nInterpretace:")
print(f"  - Kazda hodina studia navic = +{smernice:.2f} bodu na zkousce")
print(f"  - Student, ktery se neuci vubec, dostane asi {prusecik:.0f} bodu")

In [None]:
# Vizualizace modelu
plt.figure(figsize=(10, 6))

# Data
plt.scatter(X, y, c='coral', s=100, alpha=0.7, edgecolors='black', label='Skutecna data')

# Regresni primka
x_line = np.linspace(0, 10, 100).reshape(-1, 1)
y_line = model.predict(x_line)
plt.plot(x_line, y_line, 'b-', linewidth=2, label=f'Regresni primka: y = {smernice:.2f}x + {prusecik:.2f}')

# Predikce pro noveho studenta
novy_student = 10  # 10 hodin studia
predikce = model.predict([[novy_student]])[0]
plt.scatter([novy_student], [predikce], c='green', s=200, marker='*', 
            zorder=5, label=f'Predikce: {novy_student}h -> {predikce:.0f} bodu')

plt.xlabel('Hodiny studia')
plt.ylabel('Znamka (body)')
plt.title('Linearni regrese: Predikce studijnich vysledku')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"\nPredikce: Student, ktery se uci {novy_student} hodin, dostane asi {predikce:.0f} bodu.")

## 4. Metriky kvality modelu

Jak zjistit, jak dobry je nas model?

| Metrika | Vzorec | Interpretace |
|---------|--------|-------------|
| **MSE** | Prumer(chyba²) | Prumerna ctvercova chyba - mensi = lepsi |
| **RMSE** | √MSE | Odmocnina MSE - ve stejnych jednotkach jako Y |
| **MAE** | Prumer(|chyba|) | Prumerna absolutni chyba |
| **R²** | 1 - (SS_res / SS_tot) | Kolik % variability model vysvetluje (0-1) |

In [None]:
# Vypocet metrik
print("=" * 50)
print("METRIKY KVALITY MODELU")
print("=" * 50)

# Predikce na trenovacich datech
y_pred = model.predict(X)

# Metriky
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y, y_pred)
r2 = r2_score(y, y_pred)

print(f"\nMSE (Mean Squared Error): {mse:.2f}")
print(f"RMSE (Root MSE): {rmse:.2f} bodu")
print(f"MAE (Mean Absolute Error): {mae:.2f} bodu")
print(f"R² (Koeficient determinace): {r2:.4f}")

print(f"\n" + "="*50)
print(f"INTERPRETACE:")
print(f"="*50)
print(f"Model vysvetluje {r2*100:.1f}% variability v datech")
print(f"Prumerna chyba predikce je ±{rmse:.1f} bodu")

In [None]:
# Vizualizace chyb (residuí)
residua = y - y_pred

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Graf 1: Skutecne vs Predikovane
axes[0].scatter(y, y_pred, c='coral', s=80, alpha=0.7, edgecolors='black')
axes[0].plot([y.min(), y.max()], [y.min(), y.max()], 'g--', linewidth=2, label='Idealni primka')
axes[0].set_xlabel('Skutecna znamka')
axes[0].set_ylabel('Predikovana znamka')
axes[0].set_title('Skutecne vs Predikovane hodnoty')
axes[0].legend()

# Graf 2: Histogram residui
axes[1].hist(residua, bins=10, color='skyblue', edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Residuum (chyba)')
axes[1].set_ylabel('Pocet')
axes[1].set_title('Rozlozeni chyb (residui)')

# Graf 3: Residua vs Predikce
axes[2].scatter(y_pred, residua, c='coral', s=80, alpha=0.7, edgecolors='black')
axes[2].axhline(y=0, color='green', linestyle='--', linewidth=2)
axes[2].set_xlabel('Predikovana znamka')
axes[2].set_ylabel('Residuum')
axes[2].set_title('Residua vs Predikce')

plt.suptitle('Analyza chyb modelu', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nIdealně residua by mela byt nahodne rozlozena kolem nuly.")

## 5. Train/Test Split pro regresi

Stejne jako u klasifikace, musime model testovat na datech, ktera nikdy nevidel!

In [None]:
# Spravny pristup: Train/Test Split
print("=" * 50)
print("TRAIN/TEST SPLIT PRO REGRESI")
print("=" * 50)

# Rozdeleni dat
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

print(f"\nRozdeleni dat:")
print(f"  Train: {len(X_train)} vzorku")
print(f"  Test:  {len(X_test)} vzorku")

# Trenink pouze na train datech
model_proper = LinearRegression()
model_proper.fit(X_train, y_train)

# Predikce
y_train_pred = model_proper.predict(X_train)
y_test_pred = model_proper.predict(X_test)

# Metriky
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

print(f"\nVysledky:")
print(f"  Train R²: {r2_train:.4f}, RMSE: {rmse_train:.2f}")
print(f"  Test R²:  {r2_test:.4f}, RMSE: {rmse_test:.2f}")

In [None]:
# Vizualizace Train vs Test
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Graf 1: Data a model
axes[0].scatter(X_train, y_train, c='blue', s=80, alpha=0.7, label='Train data')
axes[0].scatter(X_test, y_test, c='red', s=80, alpha=0.7, label='Test data')
x_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
axes[0].plot(x_line, model_proper.predict(x_line), 'g-', linewidth=2, label='Model')
axes[0].set_xlabel('Hodiny studia')
axes[0].set_ylabel('Znamka')
axes[0].set_title('Train vs Test data')
axes[0].legend()

# Graf 2: Porovnani R2 skore
kategorie = ['Train R²', 'Test R²']
hodnoty = [r2_train, r2_test]
barvy = ['blue', 'red']
bars = axes[1].bar(kategorie, hodnoty, color=barvy, alpha=0.7, edgecolor='black')
axes[1].set_ylabel('R² Skore')
axes[1].set_title('Porovnani Train vs Test vykonnosti')
axes[1].set_ylim(0, 1)

for bar, val in zip(bars, hodnoty):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
                 f'{val:.3f}', ha='center', fontsize=12)

plt.tight_layout()
plt.show()

## 6. Vicenasobna linearni regrese

Co kdyz mame **vice vstupnich promennych**?

Napriklad: Predikce ceny domu podle:
- Plochy (m2)
- Poctu pokoju
- Stari domu

Rovnice: `y = β₁·x₁ + β₂·x₂ + β₃·x₃ + β₀`

In [None]:
# Vytvorime dataset o domech
np.random.seed(42)
n = 100

domy = pd.DataFrame({
    'plocha_m2': np.random.uniform(50, 200, n),
    'pocet_pokoju': np.random.randint(1, 6, n),
    'stari_roku': np.random.uniform(0, 50, n)
})

# Cena zavisi na vsech faktorech
domy['cena_mil'] = (
    0.05 * domy['plocha_m2'] +          # Vetsi plocha = vyssi cena
    0.3 * domy['pocet_pokoju'] +        # Vice pokoju = vyssi cena
    -0.02 * domy['stari_roku'] +        # Starsi dum = nizsi cena
    np.random.normal(0, 0.5, n) +       # Nahodny sum
    3                                    # Zakladni cena
)

print("=" * 50)
print("DATASET O DOMECH")
print("=" * 50)
print(domy.head(10))
print(f"\nStatistiky:")
print(domy.describe().round(2))

In [None]:
# Korelacni matice
plt.figure(figsize=(8, 6))
korelace = domy.corr()
sns.heatmap(korelace, annot=True, cmap='coolwarm', center=0, 
            fmt='.2f', square=True, linewidths=0.5)
plt.title('Korelacni matice - vztahy mezi promennymi')
plt.tight_layout()
plt.show()

print("\nKorelace s cenou:")
print(korelace['cena_mil'].sort_values(ascending=False))

In [None]:
# Vicenasobna linearni regrese
print("=" * 50)
print("VICENASOBNA LINEARNI REGRESE")
print("=" * 50)

# Priprava dat
X_domy = domy[['plocha_m2', 'pocet_pokoju', 'stari_roku']]
y_domy = domy['cena_mil']

# Train/Test split
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(
    X_domy, y_domy, test_size=0.2, random_state=42
)

# Trenink modelu
model_domy = LinearRegression()
model_domy.fit(X_train_d, y_train_d)

# Parametry
print(f"\nRovnice modelu:")
print(f"cena = {model_domy.coef_[0]:.4f}*plocha + {model_domy.coef_[1]:.4f}*pokoje + {model_domy.coef_[2]:.4f}*stari + {model_domy.intercept_:.4f}")

print(f"\nInterpretace koeficientu:")
for feature, coef in zip(X_domy.columns, model_domy.coef_):
    smer = "zvysi" if coef > 0 else "snizi"
    print(f"  {feature}: +1 jednotka {smer} cenu o {abs(coef):.4f} mil. Kc")

# Vyhodnoceni
y_pred_d = model_domy.predict(X_test_d)
r2_domy = r2_score(y_test_d, y_pred_d)
rmse_domy = np.sqrt(mean_squared_error(y_test_d, y_pred_d))

print(f"\nVyhodnoceni na testovacich datech:")
print(f"  R²: {r2_domy:.4f}")
print(f"  RMSE: {rmse_domy:.4f} mil. Kc")

In [None]:
# Predikce ceny noveho domu
print("=" * 50)
print("PREDIKCE CENY NOVEHO DOMU")
print("=" * 50)

novy_dum = pd.DataFrame({
    'plocha_m2': [120],
    'pocet_pokoju': [4],
    'stari_roku': [10]
})

print(f"\nParametry domu:")
print(f"  Plocha: 120 m²")
print(f"  Pocet pokoju: 4")
print(f"  Stari: 10 let")

predikce_cena = model_domy.predict(novy_dum)[0]
print(f"\nPredikovana cena: {predikce_cena:.2f} mil. Kc")

## 7. Shruti a kontrolni otazky

### Co jsme se naucili:

| Koncept | Popis |
|---------|-------|
| **Linearni regrese** | Hleda primku, ktera nejlepe vystihuje data |
| **Smernice (β₁)** | O kolik se zmeni Y, kdyz X vzroste o 1 |
| **Prusecik (β₀)** | Hodnota Y, kdyz X = 0 |
| **R² skore** | Kolik % variability model vysvetluje |
| **RMSE** | Prumerna chyba predikce |

### Kontrolni otazky:

1. **Co rika smernice regresni primky?**
2. **Proc je dulezite rozdelit data na train a test?**
3. **Co znamena R² = 0.95?**
4. **Kdy pouzit vicenasobnou regresi?**

In [None]:
# Shrnujici vizualizace
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Graf 1: Jednoducha regrese
axes[0].scatter(X, y, c='coral', s=80, alpha=0.7, edgecolors='black')
axes[0].plot(x_line, model.predict(x_line), 'b-', linewidth=2)
axes[0].set_xlabel('X')
axes[0].set_ylabel('Y')
axes[0].set_title(f'Jednoducha regrese\ny = {smernice:.2f}x + {prusecik:.2f}')

# Graf 2: Dulezitost jednotlivych promennych
importance = pd.DataFrame({
    'Feature': X_domy.columns,
    'Koeficient': np.abs(model_domy.coef_)
}).sort_values('Koeficient', ascending=True)

axes[1].barh(importance['Feature'], importance['Koeficient'], color='skyblue', edgecolor='black')
axes[1].set_xlabel('Absolutni hodnota koeficientu')
axes[1].set_title('Dulezitost promennych (vicenasobna regrese)')

plt.suptitle('Linearni regrese - shruti', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Test vasich znalosti
print("=" * 60)
print("KVIZ: TEST VASICH ZNALOSTI")
print("=" * 60)

otazky = [
    {
        "otazka": "Co rika smernice (β₁) = 10 v modelu y = 10x + 5?",
        "moznosti": ["A) Y vzdy bude 10", 
                     "B) Kdyz X vzroste o 1, Y vzroste o 10", 
                     "C) Model ma 10% presnost"],
        "spravne": "B"
    },
    {
        "otazka": "Co znamena R² = 0.85?",
        "moznosti": ["A) Model ma 85% chybu", 
                     "B) Model vysvetluje 85% variability v datech", 
                     "C) Model ma 85 parametru"],
        "spravne": "B"
    },
    {
        "otazka": "Kdy pouzijeme vicenasobnou regresi?",
        "moznosti": ["A) Kdyz mame vice vystupu", 
                     "B) Kdyz mame vice vstupnich promennych", 
                     "C) Kdyz chceme rychlejsi vypocet"],
        "spravne": "B"
    }
]

for i, q in enumerate(otazky, 1):
    print(f"\nOtazka {i}: {q['otazka']}")
    for m in q['moznosti']:
        print(f"  {m}")
    print(f"  --> Spravna odpoved: {q['spravne']}")

## 8. Vase vyzva

1. **Pridejte dalsiho studenta** do datasetu a sledujte, jak se zmeni model
2. **Vyzkoušejte jiny dataset** - napr. predikce teploty podle vlhkosti
3. **Porovnejte modely** - jednoduchá vs vícenásobná regrese na stejnych datech

In [None]:
# Vas prostor pro experimentovani
# -----------------------------

# Tip: Zkuste pridat studenta
# novy_student = pd.DataFrame({'hodiny_studia': [15], 'znamka': [100]})
# studenti_rozsireni = pd.concat([studenti, novy_student])

# Tip: Nactete dataset z internetu
# url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv'
# tips = pd.read_csv(url)

print("Experimentujte s kodem vyse!")