# 30 Days of ML

---

### Day 8 - [Basic Data Exploration](https://www.kaggle.com/dansbecker/basic-data-exploration?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-8)

#### Reading a csv file using pandas
```
import pandas as pd

csv_file = "file.csv"
data = pd.read_csv(csv_file)
```

**Optional:**
1.   ```data.describe()``` -> to look at the details of the file
2.   ```data.columns()``` -> to look at the columns of the file

---

### Day 9 - [Your First Machine Learning Model](https://www.kaggle.com/dansbecker/your-first-machine-learning-model?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-9)

#### Getting the Prediction Target and Features
**NOTE:** for convention the prediction target is ```y``` and the features are ```X```.

```
import pandas as pd

csv_file = "file.csv"
data = pd.read_csv(csv_file)

y = data.target
features = [column1, column2, column3]
X = data.features
```

#### Building a Decision Tree Regressor Model using scikit-learn
```
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

csv_file = "file.csv"
data = pd.read_csv(csv_file)

y = data.target
features = [column1, column2, column3]
X = data.features

model  = DecisionTreeRegressor(random_state = 1)
model.fit(X, y)
```
**NOTE:** Fitting is finding patterns from the data <br>
**Optional:** For checking the prediction
```
print(X.head())
print(model.predict(X.head())
```

---

### Day 9 - [Model Validation](https://www.kaggle.com/dansbecker/model-validation?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-9)
#### Calculating mean absolute error using scikit-learn
```
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

csv_file = "file.csv"
data = pd.read_csv(csv_file)

y = data.target
features = [column1, column2, column3]
X = data.features

model  = DecisionTreeRegressor(random_state = 1)
model.fit(X, y)

predicted_targets = model.predict(X)
print(mean_absolute_error(y, predicted_targets))
```
#### Better version of calculating mean absolute error using scikit-learn (train_test_split)
```
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

csv_file = "file.csv"
data = pd.read_csv(csv_file)

y = data.target
features = [column1, column2, column3]
X = data.features

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
model = DecisionTreeRegressor(random_state = 1)
model.fit(train_X, train_y)

val_predictions = model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
```

---

### Day 10 - [Underfitting and Overfitting](https://www.kaggle.com/dansbecker/underfitting-and-overfitting?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-10)
**NOTE:** <br>
1. Overfitting is where a model matches the training data almost perfectly, but does poorly in validation and new data.
2. Underfitting is where a model fails to capture important patterns in the data.

#### To find the best nodes in Decision Tree Regressor where the mean absolute error is the lowest.
```
def get_mae(leaf_nodes, train_X, val_X, train_y, val_y):
  model = DecisionTreeRegressor(max_leaf_nodes = leaf_nodes, random_state = 1)
  model.fit(train_X, train_y)
  val_predictions = model.predict(val_X)
  mae = mean_absolute_error(val_y, val_predictions)

  return mae

def get_best_mae(set_of_leaf_nodes, train_X, val_X, train_y, val_y):
  list_of_maes = []
  for leaf_nodes in set_of_leaf_nodes:
    list_of_maes.append(get_mae(leaf_nodes, train_X, val_X, train_y, val_y))

  return set_of_leaf_nodes[list_of_maes.index[min(list_of_maes)]]
```

The first function finds the mean absolute error of a given maximum leaf node. The second function finds the maximum leaf node with the lowest mean absolute error.

---

### Day 10 - [Random Forests](https://www.kaggle.com/dansbecker/random-forests?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-10)
#### Building a Random Forest Regressor model instead of Decision Tree Regressor model using scikit-learn
```
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
 
csv_file = "file.csv"
data = pd.read_csv(csv_file)
 
y = data.target
features = [column1, column2, column3]
X = data.features
 
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
model = RandomForestRegressor(random_state = 1)
model.fit(train_X, train_y)
 
val_predictions = model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
```

---

### Day 12 - [Missing Values](https://www.kaggle.com/alexisbcook/missing-values?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-12)

**Imputation** - fills in the missing values. For example, filling the missing values with the mean value <br>

#### Three approaches in dealing with missing values:
1. Approach 1 - Dropping the columns with missing values
2. Approach 2 - Imputation
3. Approach 3 - An extension to imputation

#### Setup code
```
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("file.csv")

y = data.target

melb_predictors = data.drop(['target'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
```
#### Function code to measure quality of each approach
```
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)
```

#### Sample code for approach 1
```
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))
```
#### Sample code for approach 2
```
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))
```
#### Sample code for approach 3
```
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))
```

---

### Day 12 - [Categorical Values](https://www.kaggle.com/alexisbcook/categorical-variables?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-12)

#### Three approaches in dealing categorical values:
1. Approach 1 - Drop categorical values
2. Approach 2 - Ordinal Encoding
3. Approach 3 - One-hot encoding

#### Getting the list of categorical values
```
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)
```

**NOTE**: The same code from the last lesson is used to measure the quality of each approah (score_dataset)

#### Sample code for approach 1
```
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
```
#### Sample code for approach 2
```
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
```
#### Sample code for approach 3
```
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
```

---

### Day 13 - [Pipelines](https://www.kaggle.com/alexisbcook/pipelines?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-13)

Pipelines are a simple way to keep reprocessing and modeling code organized. Benefits of Pipelines:
1. Cleaner code
2. Fewer bugs
3. Easier to productionize
4. More options for model validation 

#### Constructing the pipeline in three steps:

#### Step 1: Define preprocessing steps
Imputing missing values in numerical data and imputes missing values and applies a one-hot encoding to categorical data.
```
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])
```
#### Step 2: Defining the model
```
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)
```
#### Step 3: Create and evaluate the pipeline
```
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
```
### Day 13 - [Cross-Validation](https://www.kaggle.com/alexisbcook/cross-validation?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-13)

#### Example of cross-validation
```
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)
```

---

### Day 14 - [XGBoost](https://www.kaggle.com/alexisbcook/xgboost?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-14)

#### Sample Code
```
from xgboost import XGBRegressor as XGBR
my_model = XGBR(n_estimators = 100, learning_rate = 0.5, n_jobs = 4)
my_model.fit(X_train, y_train
            early_stopping_rounds = 5,
            eval_set = [(X_valid, y_valid)],
            verbose = False)
```

**n_estimators** --> specifies how many **times** to go through the modeling cycle; too **low** causes **underfitting** while too **high** will cause **overfitting**; typical values are from **100** to **1000**. <br>

**early_stopping_rounds** --> offers a way to automatically find the **ideal value** for n_estimators; typical value is **5**. <br>

**learning_rate** --> multiplying the predictions of each model from each model by a small number; a **smaller** learning rate and **large** number of estimators will yield more accurate XGBoost models; typical value is **1**. <br>

**n_jobs** --> used to build models **faster**; typical value is **number of cores on machine**

---

### Day 14 - [Data Leakage](https://www.kaggle.com/alexisbcook/data-leakage?utm_medium=email&utm_source=gamma&utm_campaign=thirty-days-of-ml&utm_content=day-14)

#### Two types of data leakage:
1. Target Leakage - occurs when your predictors include data that will not be available at the time you make predictions
2. Train-Test Contamination - when training data is getting mixed up with validation data

#### Sample code
```
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(my_pipeline, X, y, 
                            cv=5,
                            scoring='accuracy')
```

#### Dropping leaky predictors from dataset
```
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)
```

---

### Day 15 - [Applying the lesson](https://www.kaggle.com/raimondextervinluan/getting-started-with-30-days-of-ml-competition/edit)

We will be dealing with a dataset with no missing values but has categorical values(ordinal to be specific).

#### Step 1: Import the helpful libraries
```
# Familiar imports
import numpy as np
import pandas as pd

# For ordinal encoding categorical variables, splitting data
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

# For training random forest model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

print("Libraries imported.")
```

#### Step 2: Load the data
```
# Load the training data
train = pd.read_csv("../input/30-days-of-ml/train.csv", index_col=0)
test = pd.read_csv("../input/30-days-of-ml/test.csv", index_col=0)

# Preview the data
train.head()
```
```
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

# Preview features
features.head()
```
#### Step 3: Prepare the data
```
# List of categorical columns
object_cols = [col for col in features.columns if 'cat' in col]

# ordinal-encode categorical columns
X = features.copy()
X_test = test.copy()
ordinal_encoder = OrdinalEncoder()
X[object_cols] = ordinal_encoder.fit_transform(features[object_cols])
X_test[object_cols] = ordinal_encoder.transform(test[object_cols])

# Preview the ordinal-encoded features
X.head()
```
```
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)
```
#### Step 4: Traing the model
```
# Define the model 
model = RandomForestRegressor(random_state=1)

# Train the model (will take about 10 minutes to run)
model.fit(X_train, y_train)
preds_valid = model.predict(X_valid)
print(mean_squared_error(y_valid, preds_valid, squared=False))
```
#### Step 5: Submitting to the competition
```
# Use the model to generate predictions
predictions = model.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)
```




# Saving output of code to csv
```
#Use the model to generate predictions
predictions = lgbm_model.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)
```