# Lesson 1

## 00:00:00 - Intro

## 00:01:41 - Setting up development environment

* Crestle gives you a Juypter notebook for 3c an hour.
* Paperspace another option.
* All course data is in Fast.ai repo under `fastai` > `courses` > `ml1`.

## 00:05:14 - Recommendations for watching video

* Watch, then follow along with video later (probably more useful to in person students).

## 00:06:15 - Course approach

* Top-down approach: lot's of practical upfront, then theory later.
* Course is a summary of 25 years of Jeremy's research - not a summary of other people's research.
* Chance to practise technical writing by authoring blog posts on stuff you learn.

## 00:08:08 - Importing libraries in Juypter notebook

* Autoreload commands lets you edit source code and have it immediately available in Juypter.

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [1]:
from pathlib import Path

from fastai.imports import *
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display

from sklearn import metrics

## 00:08:42 - Why not follow Python code standards?

* Doesn't follow PEP8.
* Basic idea: data science is not software engineering, even if they eventually become them.
  * Prototyping models requires thinking about some new paradigms.
* Can figure out where a function is from by putting its name into Juypter:

In [4]:
display

<function IPython.core.display.display(*objs, include=None, exclude=None, metadata=None, transient=None, display_id=None, **kwargs)>

* 1 question mark shows docs: `?display`, 2 shows source: `??display`.

## 00:12:08 - Kaggle competition: Blue Book for Bulldozers

* Kaggle comps allow you to download a real-world dataset.
* Can submit to leaderboard of old competitions.
  * No other way to know if you're competent at solving that type of problem.
* Machine Learning can help us understand a dataset: not just make predictions of it.
* Downloading data:

In [2]:
PATH = Path('./data/bluebook')

In [6]:
PATH.mkdir(parents=True, exist_ok=True)

In [9]:
!kaggle competitions download -c bluebook-for-bulldozers --path={PATH}

Train.7z: Downloaded 7MB of 7MB to data/bluebook
Train.zip: Downloaded 9MB of 9MB to data/bluebook
Valid.7z: Downloaded 209KB of 209KB to data/bluebook
Valid.csv: Downloaded 3MB of 3MB to data/bluebook
Valid.zip: Downloaded 297KB of 297KB to data/bluebook
Data%20Dictionary.xlsx: Downloaded 11KB of 11KB to data/bluebook
median_benchmark.csv: Downloaded 192KB of 192KB to data/bluebook
Machine_Appendix.csv: Downloaded 49MB of 49MB to data/bluebook
ValidSolution.csv: Downloaded 316KB of 316KB to data/bluebook
TrainAndValid.7z: Downloaded 7MB of 7MB to data/bluebook
TrainAndValid.csv: Downloaded 114MB of 114MB to data/bluebook
TrainAndValid.zip: Downloaded 10MB of 10MB to data/bluebook
Test.csv: Downloaded 3MB of 3MB to data/bluebook
random_forest_benchmark_test.csv: Downloaded 207KB of 207KB to data/bluebook


* Can also download using [https://daniel.haxx.se/blog/2015/11/23/copy-as-curl/](copy as Curl) command in FF.
  * Ensure using `-o` flag in Curl to specify the output location.

In [11]:
!ls {PATH}

Data%20Dictionary.xlsx           TrainAndValid.zip
Machine_Appendix.csv             Valid.7z
Test.csv                         Valid.csv
Train.7z                         Valid.zip
Train.zip                        ValidSolution.csv
TrainAndValid.7z                 median_benchmark.csv
TrainAndValid.csv                random_forest_benchmark_test.csv


### 00:24:32 - Audience questions

* Q1: What are the curly brackets?
* A1: Expand Python variables before passing to the shell.

## 00:25:14 - Exploring the dataset

* It's in CSV format. Can use head to look at the first few lines:

In [16]:
!unzip {PATH}/Train.zip -d {PATH}

Archive:  data/bluebook/Train.zip
  inflating: data/bluebook/Train.csv  


In [17]:
!head {PATH}/Train.csv

SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,ProductGroupDesc,Drive_System,Enclosure,Forks,Pad_Type,Ride_Control,Stick,Transmission,Turbocharged,Blade_Extension,Blade_Width,Enclosure_Type,Engine_Horsepower,Hydraulics,Pushblock,Ripper,Scarifier,Tip_Control,Tire_Size,Coupler,Coupler_System,Grouser_Tracks,Hydraulics_Flow,Track_Type,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
1139246,66000,999089,3157,121,3,2004,68,Low,11/16/2006 0:00,521D,521,D,,,,Wheel Loader - 110.0 to 120.0 Horsepower,Alabama,WL,Wheel Loader,,EROPS w AC,None or Unspecified,,None or Unspecified,,,,,,,,2 Valve,,,,,None or Unspecified,None or Unspecified,,,,,,,,,,,,,Standard,Conventional
1139248,57000,117657,77,121,3,1996,464

* Jeremy considers this data structured data (vs unstructured: images, audio).
  * NLP people refer to structured data as something else.
* Pandas most commonly used tool for dealing with structured data.
  * Everyone uses the same abbreviation: `pd`.
* Can read a csv file using the `read_csv` command.
  * Args:
    * `parse_dates` picks which columns are dates.
    * `low_memory` - read more of the file to decide what the types are.

In [18]:
import pandas as pd

In [20]:
df_raw = pd.read_csv(f'{PATH}/Train.csv', low_memory=False, parse_dates=['saledate'])

In [21]:
df_raw.head()

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,2006-11-16,...,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,2004-03-26,...,,,,,,,,,Standard,Conventional
2,1139249,10000,434808,7009,121,3.0,2001,2838.0,High,2004-02-26,...,,,,,,,,,,
3,1139251,38500,1026470,332,121,3.0,2001,3486.0,High,2011-05-19,...,,,,,,,,,,
4,1139253,11000,1057373,17311,121,3.0,2007,722.0,Medium,2009-07-23,...,,,,,,,,,,


* Can see the last rows using the `tail()` method.
* If you have a lot of rows, can be worth transposing it with `transpose()`:

In [23]:
df_raw.tail().transpose()

Unnamed: 0,401120,401121,401122,401123,401124
SalesID,6333336,6333337,6333338,6333341,6333342
SalePrice,10500,11000,11500,9000,7750
MachineID,1840702,1830472,1887659,1903570,1926965
ModelID,21439,21439,21439,21435,21435
datasource,149,149,149,149,149
auctioneerID,1,1,1,2,2
YearMade,2005,2005,2005,2005,2005
MachineHoursCurrentMeter,,,,,
UsageBand,,,,,
saledate,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-11-02 00:00:00,2011-10-25 00:00:00,2011-10-25 00:00:00


* The value you want to predict (`SalePrice` in this example) is called the "dependent variable".

### 00:33:08 - Audience questions

* Q1: Aren't you at risk of overfitting if you spend too much time looking at the data?
* A1: Prefer "machine learning driven" exploratory data analysis.