# Exercícios - Entrando no ciclo

Chegou o momento de resolvermos alguns problemas utilizando as técnicas de ciência de dados que acumulamos até aqui!

A ideia é que vcs exercitem (idealmente em grupo) a **esteira de um projeto de data science**.

Exercitem a esteira completa (incluindo as etapas de exploração dos dados!), mas deem foco especial para a etapa de modelagem, objetivando a melhoria das **métricas de avaliação** que você(s) julgarem as mais adequadas!

<img src="https://www.abgconsultoria.com.br/blog/wp-content/uploads/img33-768x242.png" width=700>

___

Para cada um dos datasets a seguir (alguns já conhecemos), responda:

- 1 - qual é o problema a ser resolvido?
- 2 - qual é a variável resposta (target?)
- 3 - o problema em questão é um problema de classificação ou regressão?
- 4 - faça EDA dos dados!! Conheça os dados!
- 5 - crie um modelo que proporcione a melhor métrica avaliação (discuta qual métrica faz mais sentido)

Obs:

> utilize os estimadores/hipóteses que conhecemos até um momento;

> se algum integrante do grupo conhecer outros estimadores/hipóteses, o grupo pode usar estas ferramentas **contanto que o integrante que conhece compartilhe com os demais colegas a essência do estimador a ser usado**
_____

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

____
____
____

### Problema 1: Titanic

Base `titanic.csv` na pasta `/datasets`

In [17]:
titanic = pd.read_csv("../datasets/titanic.csv")
titanic.head(1)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"


In [18]:
# What is the target? 
titanic.survived.unique()

array([1, 0], dtype=int64)

In [30]:
# Identificar os nulos, neste caso, os nulos são "?" 
titanic.replace("?", np.nan, inplace = True)
titanic.isna().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [28]:
# Quantas linhas e quantas colunas?
titanic.shape

(1309, 14)

In [31]:
titanic.info()
# Limpeza dos dados
# Dropar nome, ticket, cabin, boat, body, destiny
# sex, age, fare, sibsp e parch, embarked são objetos, converter para números
# Criar uma feature family_size = family_size = sibsp + parch 
# Dropar sibsp e parch
# Dropar NaNs

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1046 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1308 non-null   object
 9   cabin      295 non-null    object
 10  embarked   1307 non-null   object
 11  boat       486 non-null    object
 12  body       121 non-null    object
 13  home.dest  745 non-null    object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB


In [43]:
titanic.drop(["name", "ticket", "cabin", "boat", "body", "home.dest"], axis = 1, inplace = True)

In [47]:
titanic.replace({"female" : 0, "male" : 1}, inplace = True)
titanic["age"] = pd.to_numeric(titanic["age"])
titanic["fare"] = pd.to_numeric(titanic["fare"])
# {"S": 1, "C": 2, "Q": 3}
titanic.replace({"S": 1, "C": 2, "Q": 3}, inplace = True)

In [65]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   int64  
 2   sex       1309 non-null   int64  
 3   age       1046 non-null   float64
 4   sibsp     1309 non-null   int64  
 5   parch     1309 non-null   int64  
 6   fare      1308 non-null   float64
 7   embarked  1307 non-null   float64
dtypes: float64(3), int64(5)
memory usage: 81.9 KB


___
___
___

### Problema 2 - Tips

Base `tips.csv` na pasta `/datasets`

In [3]:
tips = pd.read_csv("../datasets/tips.csv")
tips.head(1)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2


In [None]:
# What is the main problem?

In [None]:
# What is the target?

In [None]:
# It is a regression or classification problem?

In [None]:
# EDA (only on train set)

In [None]:
# Create the model and train it!

___
___
___

### Problema 3: house prices

Base `house_prices.csv` na pasta `/datasets`

In [6]:
price = pd.read_csv("../datasets/house_prices.csv")
price.head(1)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500


In [None]:
# What is the main problem?

In [None]:
# What is the target?

In [None]:
# It is a regression or classification problem?

In [None]:
# EDA (only on train set)

In [None]:
# Create the model and train it!

___
___
___

### Problema 4 - Iris

Base `iris.csv` na pasta `/datasets`

In [7]:
iris = pd.read_csv("../datasets/iris.csv")
iris.head(1)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa


In [None]:
# What is the main problem?

In [None]:
# What is the target?

In [None]:
# It is a regression or classification problem?

In [None]:
# EDA (only on train set)

In [None]:
# Create the model and train it!

___
___
___

### Problema 5 - breast cancer

Base `breast_cancer.csv` na pasta `/datasets`

In [8]:
df_cancer = pd.read_csv("../datasets/breast_cancer.csv")
df_cancer.head(1)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189


In [None]:
# What is the main problem?

In [None]:
# What is the target?

In [None]:
# It is a regression or classification problem?

In [None]:
# EDA (only on train set)

In [None]:
# Create the model and train it!

___
___
___

### Problema 6 - VOCÊ ESCOLHE!

Entre no [Kaggle](https://www.kaggle.com/), ou consiga dados **supervisionados** (com o target desejado) em qualquer outra fonte, e faça o que fizemos acima!

___
___
___