# Data Scientist Professional Practical Exam Submission

**Use this template to write up your summary for submission. Code in Python or R needs to be included.**


## 📝 Task List

Your written report should include both code, output and written text summaries of the following:
- Data Validation:   
  - Describe validation and cleaning steps for every column in the data 
- Exploratory Analysis:  
  - Include two different graphics showing single variables only to demonstrate the characteristics of data  
  - Include at least one graphic showing two or more variables to represent the relationship between features
  - Describe your findings
- Model Development
  - Include your reasons for selecting the models you use as well as a statement of the problem type
  - Code to fit the baseline and comparison models
- Model Evaluation
  - Describe the performance of the two models based on an appropriate metric
- Business Metrics
  - Define a way to compare your model performance to the business
  - Describe how your models perform using this approach
- Final summary including recommendations that the business should undertake

*Start writing report here..*

# Load Modules and Data

## Load Modules and set Global Variables

In [1]:
# 0. Install external modules
!pip install fitter -qU
!pip install wandb -qU

[0m

In [1]:
# 1. Load modules
import pandas as pd
import numpy as np
import chardet as ch 
import missingno as msno
import random as rnd
import math

import matplotlib.pyplot as plt
import matplotlib.style as style
import scipy.special as sp
import seaborn as sns

from fitter import Fitter, get_common_distributions

import warnings
import wandb

from scipy import stats
from scipy.stats.mstats import winsorize
from imblearn.over_sampling import SMOTE
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import LabelEncoder, PowerTransformer, OrdinalEncoder, FunctionTransformer, StandardScaler
from sklearn.metrics import r2_score,mean_squared_error, f1_score, roc_auc_score, classification_report

from sklearn.linear_model import BayesianRidge, Ridge, LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostClassifier, RandomForestClassifier

ModuleNotFoundError: No module named 'imblearn'

In [3]:
plt.style.use('ggplot')
sns.set_context("notebook")
rnd.seed(42)
np.random.RandomState(42)

RandomState(MT19937) at 0x7F0E54D2EB40

In [3]:
# login to weights and biases to track models
wandb.login()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

## Load Data

In [4]:
# check CSV file encoding to reduce reding errors and data cleanup
with open('recipe_site_traffic_2212.csv', 'rb') as file:             
    print(ch.detect(file.read()))

In [5]:
# load into pandas df
df = pd.read_csv('recipe_site_traffic_2212.csv', encoding = "ascii")     # load data into dataframe
df.info()

# Data Validation

## Inspect Raw Data

In [6]:
df.describe()

In [7]:
df.head()

In [8]:
df.tail()

In [9]:
df.sort_values(by=['calories'], ascending=False).head(20)

## Data Clean Up

Initially there are 947 rows, 8 columns. After validation, 835 rows remained
According to the data description, 
* recipe: Numeric, unique identifier of recipe
* calories: Numeric, number of calories
* carbohydrate: Numeric, amount of carbohydrates in grams
* sugar: Numeric, amount of sugar in grams
* protein: Numeric, amount of protein in grams
* category: Character, type of recipe. Recipes are listed in one of ten possible groupings (Lunch/Snacks', 'Beverages', 'Potato', 'Vegetable', 'Meat', 'Chicken, 'Pork', 'Dessert', 'Breakfast', 'One Dish Meal').
* servings: Numeric, number of servings for the recipe
* high_traffic: Character, if the traffic to the site was high when this recipe was shown, this is marked with “High”.

### Convert 'servings' Column to INT

In [10]:
df['servings'].unique()

In [11]:
df[df.servings.isin( ['4 as a snack', '6 as a snack'])]

In [12]:
df.servings = df.servings.str.replace('4 as a snack', '4') 
df.servings = df.servings.str.replace('6 as a snack', '6')

In [13]:
df.servings.unique()

In [14]:
convert_dtypes = {'servings':'int32'}
df = df.astype(convert_dtypes)
df.describe()

### Clean up 'category' and convert to categorical variable

In [15]:
df.category.unique()

In [16]:
df.category = df.category.str.replace('Chicken Breast', 'Chicken')

In [17]:
df.category = df.category.astype('category')

### Convert 'high_traffic' Column to BOOL

In [18]:
print(df.high_traffic.unique())
df.high_traffic = df.high_traffic.apply(lambda x: True if x == 'High' else False)

In [19]:
df.high_traffic.sum()

## Missing Values Treatment

### Inspect Missing Values

In [20]:
msno.matrix(df)

In [21]:
df[df.calories.isna() | df.carbohydrate.isna() | df.sugar.isna() | df.protein.isna()].sample(n=20)

### Replace missing values with median values by groups of category + servings

In [22]:
median_p_group = df.groupby(['category', 'servings'])['calories', 'carbohydrate', 'sugar', 'protein'].median()
median_p_group.reset_index(inplace=True)
median_p_group

In [23]:
df[df.calories.isna() | df.carbohydrate.isna() | df.sugar.isna() | df.protein.isna()]

In [24]:
cols_to_be_matched = ['category', 'servings']

In [25]:
recipe_w_missing = df[df.calories.isna() | df.carbohydrate.isna() | df.sugar.isna() | df.protein.isna()].recipe
df = df.set_index(cols_to_be_matched).combine_first(median_p_group.set_index(cols_to_be_matched)).reset_index()
df[df.recipe.isin(recipe_w_missing)]

### Drop Rows with null values in **all** numeric columns

In [26]:
df = df.dropna(subset=['calories', 'carbohydrate', 'sugar', 'protein'])
df.info()

### Find distribution of best fit to numeric columns

In [27]:
calories_dist = Fitter(df.calories, distributions= get_common_distributions())
calories_dist.fit()
calories_dist.summary()

In [28]:
carbohydrate_dist = Fitter(df.carbohydrate, distributions= get_common_distributions())
carbohydrate_dist.fit()
carbohydrate_dist.summary()

In [29]:
sugar_dist = Fitter(df.sugar, distributions= get_common_distributions())
sugar_dist.fit()
sugar_dist.summary()

In [30]:
protein_dist = Fitter(df.protein, distributions= get_common_distributions())
protein_dist.fit()
protein_dist.summary()

## Create a '_per_serving' column copies for 'calories', 'carbohydrate', 'sugar', 'protein' columns

In [31]:
cols_orig = ['calories', 'carbohydrate', 'sugar', 'protein']

In [32]:
cols_p_srv = [col + '_p_srv' for col in cols_orig]
for i in cols_orig: 
    df[i + '_p_srv'] = df.apply(lambda x: x[i] / x['servings'], axis=1)

### Convert numeric columns to log scale or other trasnformation to make them more normal

In [33]:
cols_log = [col + '_log' for col in cols_orig]
# apply log transformations to num columns
transformer = FunctionTransformer(np.log1p)
df[cols_log] = transformer.fit_transform(df[cols_orig])

In [34]:
fig, axes = plt.subplots(2,4,figsize=(20,7))
for i in range(4):
    sns.histplot(df[cols_orig[i]], ax=axes[0,i])
for i in range(4):
    sns.histplot(df[cols_log[i]], ax=axes[1,i])

### Apply Box-Cox transformation to normalize distributions  

In [35]:
cols_bc = [col + '_bc' for col in cols_orig]
lmbda = dict()
for i in cols_orig:
    df[i + '_bc'], lmbda[i] = stats.boxcox(df[i]+1)

In [36]:
fig, axes = plt.subplots(3,4,figsize=(20,7))
for i in range(4):
    sns.histplot(df[cols_orig[i]], ax=axes[0,i])
for i in range(4):
    sns.histplot(df[cols_bc[i]], ax=axes[1,i])
for i in range(4):
    sns.histplot(df[cols_log[i]], ax=axes[2,i])

In [37]:
lmbda

### Outlier treatment

In [38]:
# Count number of points outside of interquartile range
cols_input = cols_p_srv
Q1 = df[cols_input].quantile(0.25)
Q3 = df[cols_input].quantile(0.75)
IQR = Q3 - Q1
outliers = ((df[cols_input] < (Q1 - 1.5 * IQR)) | (df[cols_input] > (Q3 + 1.5 * IQR))).sum()
percentage = (((df[cols_input] < (Q1 - 1.5 * IQR)) | (df[cols_input] > (Q3 + 1.5 * IQR))).sum()*100/df.shape[0])
outliers_stats = pd.concat([outliers, percentage, Q1 - 1.5 * IQR, Q3 + 1.5 * IQR], axis = 1).rename(columns={0:'count', 1:'%', 2:'Q1 - 1.5 * IQR', 3:'Q3 + 1.5 * IQR'})
outliers_stats

In [39]:
# winsorize outliers
myVars = globals()
df_wo_outlrs = df[(df[cols_input] > (Q1 - 1.5 * IQR)) & (df[cols_input] < (Q3 + 1.5 * IQR))]
for i in cols_input:
    myVars[i + '_lims' ] = [df_wo_outlrs[i].min(), df_wo_outlrs[i].max()]

In [40]:
print([myVars[i + '_lims' ] for i in cols_input])

In [41]:
for i in cols_input:
    print(stats.percentileofscore(df[i], myVars[i + '_lims' ]))

In [42]:
for i in cols_input:
    percents = stats.percentileofscore(df[i], myVars[i + '_lims' ])/100
    df[i + '_wsrd'] = winsorize(df[i], [percents[0], 1-percents[1]] )

In [43]:
cols_wsrd = [i + '_wsrd' for i in cols_input]

# apply log transformations to num columns
transformer = FunctionTransformer(np.log1p)
df[[x + '_log' for x in cols_wsrd]] = transformer.fit_transform(df[cols_wsrd])

In [44]:
cols_wsrd_log =  [i + '_log' for i in cols_wsrd]

fig, axes = plt.subplots(2,4,figsize=(20,7))
for i in range(4):
    sns.histplot(df[cols_wsrd[i]], ax=axes[0,i])
for i in range(4):
    sns.histplot(df[cols_wsrd_log[i]], ax=axes[1,i])

### Review and validate data

In [45]:
df.sample(n=10)

In [46]:
df.info()

# Exploratory Analysis

In [47]:
target = ['high_traffic']
cols_cat = ['category', 'servings']
print(cols_orig)
print(cols_log)
print(cols_p_srv)
print(cols_bc)
print(cols_wsrd_log)

In [48]:
sns.histplot(df['calories'])

In [49]:
sns.histplot(df['high_traffic'])

In [50]:
sns.boxplot(data=df[cols_wsrd], orient="h")

In [51]:
sns.boxplot(data=df[cols_bc], orient="h")

In [52]:
sns.pairplot(df[cols_bc + ['category'] + target], hue='category')

In [53]:
fig = plt.figure(figsize=(20,15))
sns.boxplot(data=df, y="calories_bc", x="category", hue="high_traffic")

In [54]:
sns.violinplot(data=df[cols_bc])

In [55]:
g = sns.FacetGrid(df, col="high_traffic", row="category")
g.map_dataframe(sns.histplot, x="calories_log")

In [56]:
# Target Variable
fig, axes = plt.subplots(2,4,figsize=(20,7))
sns.histplot(df['calories'], ax = axes[0,0])
sns.histplot(df['carbohydrate'], ax = axes[0,1])
sns.histplot(df['sugar'], ax = axes[0,2])
sns.histplot(df['protein'], ax = axes[0,3])

# Model Fitting and Evaluation

In [57]:
df_model = df.copy()
cat_encode = LabelEncoder()
ord_encode = OrdinalEncoder()
cat_cols = ['servings', 'category']
ord_col = ['recipe']

df_model[cat_cols] = ord_encode.fit_transform(df_model[cat_cols]) 
df_model[ord_col] = ord_encode.fit_transform(df_model[ord_col]) 

In [58]:
target = ['high_traffic']
feature_cols = cols_bc + ['servings', 'category'] + cols_p_srv #+ cols_wsrd_log #, 'recipe']
X = df_model[feature_cols]           # Features
y = df_model[target]                 # Target variable

## Balancing classes of 'high_traffic'

In [59]:
smote = SMOTE()
X, y = smote.fit_resample(X, y)
df_model = pd.concat([pd.DataFrame(X), pd.DataFrame(y)], axis=1)

In [60]:
# define the scaler 
scaler = StandardScaler()
# fit and transform the train set
X[cols_bc] = scaler.fit_transform(X[cols_bc])
#X[cols_p_srv] = scaler.fit_transform(X[cols_p_srv])

In [61]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Decision Tree Classifier

In [62]:
tc = DecisionTreeClassifier()
cv_score = cross_val_score(tc, X_train, y_train, cv=10)
tc.fit(X_train, y_train)

In [63]:
cv_score

In [64]:
y_pred = tc.predict(X_test)
print('Classification Report: ')
print(classification_report(y_test,y_pred))

In [65]:
tc.get_params()

In [66]:
# With Random Hyperparameter tuning
tc = DecisionTreeClassifier()
param_space = {'criterion': ['gini', 'entropy', 'log_loss'], 'ccp_alpha':[0, 0.01, 0.1, 1], 'max_features':['auto', 'sqrt', 'log2']}
stc = RandomizedSearchCV(tc, param_space, random_state=0)
search = stc.fit(X_train, y_train)
search.best_params_

In [67]:
y_pred = stc.predict(X_test)
print('Classification Report: ')
print(classification_report(y_test,y_pred))

In [68]:
stc.best_score_

In [69]:
model_params = stc.get_params()
#importances = stc.feature_importances_
model_params

{'cv': None,
 'error_score': nan,
 'estimator__ccp_alpha': 0.0,
 'estimator__class_weight': None,
 'estimator__criterion': 'gini',
 'estimator__max_depth': None,
 'estimator__max_features': None,
 'estimator__max_leaf_nodes': None,
 'estimator__min_impurity_decrease': 0.0,
 'estimator__min_samples_leaf': 1,
 'estimator__min_samples_split': 2,
 'estimator__min_weight_fraction_leaf': 0.0,
 'estimator__random_state': None,
 'estimator__splitter': 'best',
 'estimator': DecisionTreeClassifier(),
 'n_iter': 10,
 'n_jobs': None,
 'param_distributions': {'criterion': ['gini', 'entropy', 'log_loss'],
  'ccp_alpha': [0, 0.01, 0.1, 1],
  'max_features': ['auto', 'sqrt', 'log2']},
 'pre_dispatch': '2*n_jobs',
 'random_state': 0,
 'refit': True,
 'return_train_score': False,
 'scoring': None,
 'verbose': 0}

In [70]:
# start a new wandb run and add your model hyperparameters
wandb.init(project='datacamp-certification', config=model_params)

In [None]:
# Add additional configs to wandb
#wandb.config.update({"test_size" : test_size,
#                    "train_len" : len(X_train),
#                    "test_len" : len(X_test)})

# log additional visualisations to wandb
Xplot_class_proportions(y_train, y_test, labels)
plot_learning_curve(stc, X_train, y_train)
plot_roc(y_test, y_probas, labels)
plot_precision_recall(y_test, y_probas, labels)
plot_feature_importances(stc)

# [optional] finish the wandb run, necessary in notebooks
wandb.finish()

## Logistic Regression Classifier

In [69]:
lgc = LogisticRegression(random_state=0)
cv_score = cross_val_score(lgc, X_train, y_train, cv=10)
lgc.fit(X_train, y_train)

In [70]:
cv_score

array([0.79012346, 0.75308642, 0.69135802, 0.75      , 0.7625    ,
       0.8125    , 0.6875    , 0.7625    , 0.7       , 0.75      ])

In [71]:
y_pred = lgc.predict(X_test)
print('Classification Report: ')
print(classification_report(y_test,y_pred))

Classification Report: 
              precision    recall  f1-score   support

       False       0.70      0.86      0.77       159
        True       0.85      0.69      0.76       186

    accuracy                           0.77       345
   macro avg       0.78      0.77      0.77       345
weighted avg       0.78      0.77      0.77       345



In [72]:
lgc.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 0,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [73]:
# With Random Hyperparameter tuning
lgc = LogisticRegression()
param_space = {'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
               'penalty':['none', 'l1', 'l2', 'elasticnet'],
               'C':[100, 10, 1.0, 0.1, 0.01]}
slgc = RandomizedSearchCV(lgc, param_space, random_state=0)
search = slgc.fit(X_train, y_train)
search.best_params_

{'solver': 'sag', 'penalty': 'l2', 'C': 0.01}

In [74]:
y_pred = slgc.predict(X_test)
print('Classification Report: ')
print(classification_report(y_test,y_pred))

Classification Report: 
              precision    recall  f1-score   support

       False       0.78      0.42      0.54       159
        True       0.64      0.90      0.75       186

    accuracy                           0.68       345
   macro avg       0.71      0.66      0.64       345
weighted avg       0.70      0.68      0.65       345



In [75]:
slgc.best_score_

0.6014673913043479

## Adaboost Classifier

In [76]:
abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_features='sqrt', criterion='gini', ccp_alpha=0.01))
abc.fit(X_train, y_train)

In [77]:
y_pred = abc.predict(X_test)
print('Classification Report: ')
print(classification_report(y_test,y_pred))

Classification Report: 
              precision    recall  f1-score   support

       False       0.71      0.82      0.76       159
        True       0.82      0.72      0.76       186

    accuracy                           0.76       345
   macro avg       0.77      0.77      0.76       345
weighted avg       0.77      0.76      0.76       345



In [78]:
abc.get_params()

{'algorithm': 'SAMME.R',
 'base_estimator__ccp_alpha': 0.01,
 'base_estimator__class_weight': None,
 'base_estimator__criterion': 'gini',
 'base_estimator__max_depth': None,
 'base_estimator__max_features': 'sqrt',
 'base_estimator__max_leaf_nodes': None,
 'base_estimator__min_impurity_decrease': 0.0,
 'base_estimator__min_samples_leaf': 1,
 'base_estimator__min_samples_split': 2,
 'base_estimator__min_weight_fraction_leaf': 0.0,
 'base_estimator__random_state': None,
 'base_estimator__splitter': 'best',
 'base_estimator': DecisionTreeClassifier(ccp_alpha=0.01, max_features='sqrt'),
 'learning_rate': 1.0,
 'n_estimators': 50,
 'random_state': None}

In [79]:
# With Random Hyperparameter tuning
abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_features='sqrt', criterion='gini', ccp_alpha=0.01))
param_space = {'learning_rate': np.linspace(0.0, 1.0, num=20) , 'n_estimators':[10, 50, 100, 500]}
sabc = RandomizedSearchCV(abc, param_space, random_state=0)
search = sabc.fit(X_train, y_train)
search.best_params_

{'n_estimators': 100, 'learning_rate': 0.3684210526315789}

In [80]:
y_pred = sabc.predict(X_test)
print('Classification Report: ')
print(classification_report(y_test,y_pred))

Classification Report: 
              precision    recall  f1-score   support

       False       0.70      0.83      0.76       159
        True       0.83      0.69      0.75       186

    accuracy                           0.76       345
   macro avg       0.76      0.76      0.76       345
weighted avg       0.77      0.76      0.76       345



## Random Forest Classifier

In [81]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

In [82]:
y_pred = rfc.predict(X_test)
print('Classification Report: ')
print(classification_report(y_test,y_pred))

Classification Report: 
              precision    recall  f1-score   support

       False       0.75      0.84      0.79       159
        True       0.85      0.76      0.80       186

    accuracy                           0.80       345
   macro avg       0.80      0.80      0.80       345
weighted avg       0.80      0.80      0.80       345



## ✅ When you have finished...
-  Publish your Workspace using the option on the left
-  Check the published version of your report:
	-  Can you see everything you want us to grade?
    -  Are all the graphics visible?
-  Review the grading rubric. Have you included everything that will be graded?
-  Head back to the [Certification Dashboard](https://app.datacamp.com/certification) to submit your practical exam report and record your presentation