# TP - Régression linéaire et algorithme de descente du gradient

Le TP porte sur des données collectées au début des années 1970 par les services de la ville de Boston (USA) au sujet du logement dans divers quartiers :
* ...
* 6- AGE : proportion de logements occupés par leur propriétaires et construits avant 1940
* 7- DIS : distance (pondérée) à 5 bassins d’emplois
* ...
* 12- LSTAT : % de la population de milieu socio-économique plus défavorisé
* 13- MEDV : valeur médiane des logements occupés par leur propriétaires (×$1, 000)


L'**objectif** est d'étudier le lien entre la valeur des logements d'un quartier et l'ancienneté, la distance aux bassins d'emploi et le niveau socio-économique du quartier.

Note : les données ont été originellement publiées par Harrison, D. and Rubinfeld, D.L. _Hedonic prices and the demand for clean air_, J. Environ. Economics &
Management, vol.5, 81-102, 1978.

In [7]:
from sklearn import datasets # donnees
import os # rep de travail
import pandas as pd # data analysis
from scipy import stats # stat desc
import matplotlib.pyplot as plt # graphiques
import numpy as np # maths
import seaborn as sns

PATH = ""

In [12]:
students={}

donnees = pd.read_excel(PATH+"data_academic_performance.xlsx")

print(donnees.dtypes)

"""
Utile de garder les noms pour choisir ceux qu'on veut enlever 
"""
numerical_attribute = ("MAT_S11",
                       "CR_S11",
                       "CC_S11",
                       "BIO_S11",
                       "ENG_S11",
                       "QR_PRO",
                       "CR_PRO",
                       "CC_PRO",
                       "ENG_PRO",
                       "WC_PRO",
                       "FEP_PRO",
                       "G_SC",
                       "PERCENTILE",
                       "2ND_DECILE",
                       "QUARTILE",
                       "SEL",
                       "SEL_IHE")
categorical_attributes = ("GENDER",
                          "EDU_FATHER",
                          "EDU_MOTHER",
                          "OCC_FATHER",
                          "OCC_MOTHER",
                          "STRATUM",
                          "SISBEN",
                          "PEOPLE_HOUSE",
                          "INTERNET",
                          "TV",
                          "COMPUTER",
                          "WASHING_MCH",
                          "MIC_OVEN",
                          "CAR",
                          "DVD",
                          "FRESH",
                          "PHONE",
                          "MOBILE",
                          "REVENUE",
                          "JOB",
                          "SCHOOL_NAT",
                          "SCHOOL_TYPE",
                          "ACADEMIC_PROGRAM",
                          #"COD_S11",
                          #"Cod_SPro",
                          #"UNIVERSITY",
                          #"SCHOOL_NAME"
                         )

print(donnees.head())

print(donnees.describe())

students["DESCR"] = """ .. data_academic_performance:

Data of Academic Performance evolution for Engineering Students
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This data article presents data on the results in national assessments for secondary and university education in engineering students. 
The data contains academic, social, economic information for 12,411 students. 
The data were obtained by orderly crossing the databases of the Colombian Institute for the Evaluation of Education (ICFES). 
The data are part of the Master's Degree in Engineering project of the Technological University of Bolívar (UTB) 
titled Academic Efficiency Analysis in Engineering students. Developed by Professor Enrique De La Hoz and engineer Rohemi Zuluaga.

.. topic:: References

    - De La Hoz, Enrique (2020), “Data of Academic Performance evolution for Engineering Students”, Mendeley Data, V1, doi: 10.17632/83tcx8psxv.1"""

COD_S11              object
GENDER               object
EDU_FATHER           object
EDU_MOTHER           object
OCC_FATHER           object
OCC_MOTHER           object
STRATUM              object
SISBEN               object
PEOPLE_HOUSE         object
Unnamed: 9          float64
INTERNET             object
TV                   object
COMPUTER             object
WASHING_MCH          object
MIC_OVEN             object
CAR                  object
DVD                  object
FRESH                object
PHONE                object
MOBILE               object
REVENUE              object
JOB                  object
SCHOOL_NAME          object
SCHOOL_NAT           object
SCHOOL_TYPE          object
MAT_S11               int64
CR_S11                int64
CC_S11                int64
BIO_S11               int64
ENG_S11               int64
Cod_SPro             object
UNIVERSITY           object
ACADEMIC_PROGRAM     object
QR_PRO                int64
CR_PRO                int64
CC_PRO              