# Projeto 2 - C Dados

_Português - Este Projeto está atualmente sendo mantido por Enricco Gemha, Paola BBS Friedel, Rafael C Leventhal, Marcelo R Barranco._

_English - This Project is currently being held by Enricco Gemha, Paola BBS Friedel, Rafael C Leventhal, Marcelo R Barranco._

## Aviso
Para nos referirmos a certos termos recorrentes, usaremos abreviações, todas listadas abaixo:
- df: DataFrame
- pd: biblioteca do Python chamada Pandas

## Kickstart do Projeto

In [404]:
# importando bibliotecas que serão utilizadas e definindo ajustes
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt

In [405]:
# transformando a database em um df do pd
students = pd.read_csv('students.csv')

In [406]:
# checando o df
students

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,5,5,4,4,5,4,11,9,9,9
391,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,3,14,16,16
392,MS,M,21,R,GT3,T,1,1,other,other,...,5,5,3,3,3,3,3,10,8,7
393,MS,M,18,R,LE3,T,3,2,services,other,...,4,4,1,3,4,5,0,11,12,10


## Limpando o df

**Próximo passo**: Retirar as features (colunas) cujas informações não serão utilizadas nesse projeto.

In [407]:
# students.drop(columns=['G2', 'G1', 'school'], inplace=True)
students

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,services,services,...,5,5,4,4,5,4,11,9,9,9
391,MS,M,17,U,LE3,T,3,1,services,services,...,2,4,5,3,4,2,3,14,16,16
392,MS,M,21,R,GT3,T,1,1,other,other,...,5,5,3,3,3,3,3,10,8,7
393,MS,M,18,R,LE3,T,3,2,services,other,...,4,4,1,3,4,5,0,11,12,10


A próxima etapa é transformar as variáveis qualitativas em quantitativas.

In [408]:
train = students.copy()
for column in train.columns:
    if train[column].dtype.kind == 'O':
        dummies = pd.get_dummies(train[column]).iloc[:, 1:]
        dummies.columns = [f'{column} {col}' for col in dummies.columns]
        train = pd.concat(
            [
                train, dummies
            ], axis=1
        )
        train.drop(columns=column, inplace=True)

train

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,guardian mother,guardian other,schoolsup yes,famsup yes,paid yes,activities yes,nursery yes,higher yes,internet yes,romantic yes
0,18,4,4,2,2,0,4,3,4,1,...,1,0,1,0,0,0,1,1,0,0
1,17,1,1,1,2,0,5,3,3,1,...,0,0,0,1,0,0,0,1,1,0
2,15,1,1,1,2,3,4,3,2,2,...,1,0,1,0,1,0,1,1,1,0
3,15,4,2,1,3,0,3,2,2,1,...,1,0,0,1,1,1,1,1,1,1
4,16,3,3,1,2,0,4,3,2,1,...,0,0,0,1,1,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,20,2,2,1,2,2,5,5,4,4,...,0,1,0,1,1,0,1,1,0,0
391,17,3,1,2,1,0,2,4,5,3,...,1,0,0,0,0,0,0,1,1,0
392,21,1,1,1,1,3,5,5,3,3,...,0,1,0,0,0,0,0,1,0,0
393,18,3,2,3,1,0,4,4,1,3,...,1,0,0,0,0,0,0,1,1,0


## Análise Exploratória

**Próximo passo**: Com o objetivo de encontrar features que possuem forte correlação, ou seja, cujos temas convergem, devemos utilizar o comando `.corr()` para obter as correlações entre as features do df. Vale lembrar que uma correlação é forte se: `|corr| >= 0.7`

In [424]:
# separando o target e as features em 2 variáveis diferentes
target = train.G3.to_numpy()
features = train.drop(columns='G3').to_numpy()

In [410]:
features_corr = features.corr()

for coluna in list(features_corr):
    for correlation in features_corr[coluna]:
        if correlation >= 0.7 and correlation != 1.0:
            print(correlation)
            print('\n')



0.852118066364806


0.852118066364806




Após a análise do df `features_corr` pode-se concluir que todas as features que foram comparadas entre si tem correlação fraca ou moderada, o que impossibilita que sejam unidas por semelhança. Portanto, conclui-se que cada feature é única na influência que possui na variável `target`.

## Relacionando target com features

**Próximo passo**: agora que testamos a correlação entre as features, partiremos para correlacionar a variável `target` com cada feature individualmente. 

In [411]:
target_corr = features.corrwith(target, method='pearson')
target_corr.sort_values(ascending=True, inplace=True)
target_corr

failures            -0.360415
age                 -0.161579
goout               -0.132791
romantic yes        -0.129970
traveltime          -0.117142
Mjob other          -0.096477
guardian other      -0.087774
schoolsup yes       -0.082788
health              -0.061335
Pstatus T           -0.058009
Dalc                -0.054660
Fjob other          -0.053483
Walc                -0.051939
school MS           -0.045017
famsup yes          -0.039157
reason home         -0.021359
Fjob services       -0.016108
freetime             0.011307
activities yes       0.016100
guardian mother      0.022338
absences             0.034247
famrel               0.051363
nursery yes          0.051568
reason other         0.052008
Fjob health          0.057111
Mjob teacher         0.057712
Mjob services        0.078429
famsize LE3          0.081407
Fjob teacher         0.095374
reason reputation    0.095692
studytime            0.097820
internet yes         0.098483
paid yes             0.101996
sex M     

In [456]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


X_train, X_test, Y_train, Y_test = train_test_split(features, target, test_size=0.3)
model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

mean_squared_error(Y_test, Y_pred)

4.01063200670858

## Referências Bibliográficas

- to-do