## Additional Learning Resources
Refer to [scikit-learn documentation](https://scikit-learn.org/stable/) and the [Pandas user guide](https://pandas.pydata.org/docs/) for detailed explanations of the functions used in this notebook.
For a quick refresher on splitting data:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```


# Machine Learning Workflow

### 1. Define Business Goal

Goal should be something measurable; e.g. create a model that classifies penguin species with 85% accuracy.

### 2. Get Data

How else would you get data?
* web scraping (week 4)
* public APIs (week 6)
* published/public datasets (e.g. government databases, open data initiatives)
* buy data
* hacking
* sensor/camera/satellite data
* collect yourself, e.g. surveys

In [19]:
import pandas as pd

In [20]:
df = pd.read_csv('penguins_simple.csv', sep=';')

In [21]:
df

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE
...,...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0,FEMALE
329,Gentoo,46.8,14.3,215.0,4850.0,FEMALE
330,Gentoo,50.4,15.7,222.0,5750.0,MALE
331,Gentoo,45.2,14.8,212.0,5200.0,FEMALE


#### 2.1. Assign X and y

In [7]:
X = df[['Flipper Length (mm)', 'Body Mass (g)']]
y = df['Species']

In [22]:
X.shape, y.shape

((333, 2), (333,))

In [23]:
X

Unnamed: 0,Flipper Length (mm),Body Mass (g)
0,181.0,3750.0
1,186.0,3800.0
2,195.0,3250.0
3,193.0,3450.0
4,190.0,3650.0
...,...,...
328,214.0,4925.0
329,215.0,4850.0
330,222.0,5750.0
331,212.0,5200.0


In [24]:
y

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
328    Gentoo
329    Gentoo
330    Gentoo
331    Gentoo
332    Gentoo
Name: Species, Length: 333, dtype: object

### 3. Train-Test Split

We train-test split to build a generalizable model that doesn't overfit. \
Doing train-test split helps us _detect_ overfitting — it doesn't help us _prevent_ overfitting. Regularization (next week!) helps us do that.

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
# OPTION 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100000)

In [27]:
# OPTION 2
# For EDA you may want to split the original dataframe, and assign X and y later
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

### 4. Explore the Data

Be creative!
* Plot all the things!
* Check for missing data 
* `df.corr()`
* `sns.pairplot()`

Do data exploration on _training data_ only!

### 5. Feature Engineering

FE is about finding the right representation for your data:
* Which features help me build a better model?
* How should I format them to include them into my model?

### 6. Train model

In [28]:
from sklearn.dummy import DummyClassifier

In [29]:
m = DummyClassifier() # logistic regression, decision tree, random forest

In [30]:
#train the model
m.fit(X_train, y_train)

DummyClassifier()

### 7. Optimize

* Hyperparameter optimization next week

### 8. Calculate Test Score

In [31]:
# run this for your training dataset
m.score(X_train, y_train)

0.424812030075188

In [32]:
# run this for your test dataset
m.score(X_test, y_test)

0.4925373134328358

In [18]:
# run this for Kaggle test dataset
m.predict(X_test_kaggle)

NameError: name 'X_test_kaggle' is not defined

### 9. Submit to Kaggle