In [ ]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5334a33f",
   "metadata": {},
   "source": [
    "# Objectifs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b94eae2b",
   "metadata": {},
   "source": [
    "Les objectifs de la leçon sont les suivants:\n",
    "* Calculer la correlation entre 2 variables\n",
    "* Fit une régression linéaire avec `statsmodels`\n",
    "* Interpreter la qualité de la régression linéaire"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "011c498f",
   "metadata": {},
   "source": [
    "# Importer les librairies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aec469bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os\n",
    "\n",
    "# Visualisation\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "# Machine learning\n",
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8aaa7134",
   "metadata": {},
   "source": [
    "# Loader les données"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91f5951f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loader les data\n",
    "filepath = os.getcwd() + \"/Data/salary_data.csv\"\n",
    "data = pd.read_csv(\n",
    "    filepath,\n",
    "    delimiter=\",\"\n",
    ")\n",
    "\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "158333fc",
   "metadata": {},
   "source": [
    "# (rapide) Analyse exploratoire des données"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff5e2126",
   "metadata": {},
   "source": [
    "## Analyse rapide"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b3d0917",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quelques statistics \n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f4664d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Manque-t-il des valeurs (NaN)?\n",
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa7cd38a",
   "metadata": {},
   "source": [
    "## Analyse bivariée"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81b6d1fe",
   "metadata": {},
   "source": [
    "[`sns.pairplot()`](https://seaborn.pydata.org/generated/seaborn.pairplot.html) permet de visualiser les relations entre les variables de notre jeu de données."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d29daace",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Etudier la relation entre les variables indépentes et la variable dépendente\n",
    "sns.pairplot(\n",
    "    data=data,\n",
    "    corner=False # pour eviter d'avoir les scatter plots en double\n",
    ")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c63fce3",
   "metadata": {},
   "source": [
    "Les 2 variables quantitatives semblent être **liées d'un point de vue linéaire**. On s'attend donc à avoir un **coefficient de correlation positif et donc proche de 1**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af867d38",
   "metadata": {},
   "source": [
    "## Correlation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34102c3c",
   "metadata": {},
   "source": [
    "La fonction [`.corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) permet de calculer la correlation entre les variables **quantitatives**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b9b21d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculons le coefficient de correlation entre les 2 variables\n",
    "data.corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe36ff19",
   "metadata": {},
   "source": [
    "Le coefficient de correlation étant entre 0.6 et 1.0, la **correlation est donc élevée**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28737428",
   "metadata": {},
   "source": [
    "# Régression linéaire"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e899d7f1",
   "metadata": {},
   "source": [
    "Nos premières assumptions sont confirmées: les variables sont liées linéairement, corréléees et on aimerait trouver la droite de best fit qui passe entre ces 2 variables. Pour cela, nous allons utiliser la fonction [`statsmodels.OLS`](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html) (Ordinary Least Squares), qui est la méthode classique.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a52e34a8",
   "metadata": {},
   "source": [
    "## Fit une régression linéaire"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c9008ca",
   "metadata": {},
   "source": [
    "Il faut dans un premier temps que nous séparions en deux dataframes les variables: une variable X et une variable Y."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99890031",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Séparer les variables\n",
    "x = data[\"YearsExperience\"]\n",
    "y = data[\"Salary\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25a2961c",
   "metadata": {},
   "source": [
    "Voici le modèle que nous allons produire:\n",
    "\n",
    "$$ \\underbrace{\\text{Salary}}_{y} = \\underbrace{a}_{\\text{slope/pente}} \\cdot  \\underbrace{YearsExperience}_{\\text{x}} + \\underbrace{b}_{\\text{intercept}} $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49b09ae6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Générer un model de régression linéaire\n",
    "model = sm.OLS(\n",
    "    y,\n",
    "    sm.add_constant(x) # il faut rajouter une constante, qui n'est pas là par défaut\n",
    ")\n",
    "\n",
    "# Fit le model\n",
    "model_fit = model.fit()\n",
    "\n",
    "# Quel est le résumé?\n",
    "print(model_fit.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3935f71c",
   "metadata": {},
   "source": [
    "Un tas d'information nous ai présenté dans le résumé ci-dessus. On remarque notamment que le **R-squared est de 95.7%**, ce qui est très élevé. Ceci était attendu puisque le coefficient de correlation était supérieur à 0.9.\n",
    "\n",
    "On note aussi que la **slope/pente est positive et élevé** (9449), ce qui confirme que plus l'on a d'années d'expérience, plus notre salaire semble être élevé."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d4d1ad0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Faisons un plot de la régression linéaire\n",
    "sns.regplot(\n",
    "    data=data,\n",
    "    x=x,\n",
    "    y=y,\n",
    "    ci=False, # no need for the confidence interval\n",
    "    line_kws={\"color\": \"red\"} # red line\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}