# Analytics Vidhya - Guided Hackathon 2

# Problem Statement

As YouTube becomes one of the most popular video-sharing platforms, YouTuber is developed as a new type of career in recent decades. YouTubers earn money through advertising revenue from YouTube videos, sponsorships from companies, merchandise sales, and donations from their fans. In order to maintain a stable income, the popularity of videos become the top priority for YouTubers. Meanwhile, some of our friends are YouTubers or channel owners in other video-sharing platforms. This raises our interest in predicting the performance of the video. If creators can have a preliminary prediction and understanding on their videos’ performance, they may adjust their video to gain the most attention from the public.

You have been provided details on videos along with some features as well. Can you accurately predict the number of likes for each video using the set of input variables?


**Refer the dataset :-** https://www.kaggle.com/jainpooja/av-guided-hackathon

**Load the libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-dark')

import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import KFold

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, ElasticNet

from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.simplefilter('ignore')
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Load the dataset**

In [None]:
ss = pd.read_csv('../input/av-guided-hackathon/sample_submission_cxCGjdN.csv')
train = pd.read_csv('../input/av-guided-hackathon/train.csv')
test = pd.read_csv('../input/av-guided-hackathon/test.csv')

In [None]:
ss.head(10)

In [None]:
train.head(3)

In [None]:
test.head(3)

**Identify target columns and feature variables**

In [None]:
ID_COL, TARGET_COL = 'video_id', 'likes'

In [None]:
print(f'\nTrain data contains {train.shape[0]} samples and {train.shape[1]} variables')
print(f'\nTest data contains {test.shape[0]} samples and {test.shape[1]} variables')

features = [c for c in train.columns if c not in [ID_COL, TARGET_COL]]
print(f'\nThe dataset contains total {len(features)} features')

**Let's check the target distribution as it is a regression problem**

In [None]:
_ = train[TARGET_COL].plot(kind = 'density', title = 'Likes Distribution', fontsize=14, figsize=(10, 6))

**As we see the data is highly right skewed, we will apply log transformation**

In [None]:
_ = pd.Series(np.log1p(train[TARGET_COL])).plot(kind = 'density', title = 'Log Likes Distribution', fontsize=14, figsize=(10, 6))

**Let check the datatype of all the columns**

In [None]:
train.info()

It is clear from the above output that there are no null values

**Unique values in each variable**

In [None]:
train.nunique()

In [None]:
train.columns

In [None]:
num_cols = ['views', 'dislikes', 'comment_count']

**Univariate Analysis**

**Numeric Columns**

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(8, 9))
for i, c in enumerate(num_cols):
  _ = train[[c]].boxplot(ax=axes[i], vert=False)

**Log Transformation of Numerical Columns**

In [None]:
for c in num_cols + ['likes']:
  train[c] = np.log1p(train[c]) 

In [None]:
fig, axes = plt.subplots(4, 1, figsize=(8, 9))
for i, c in enumerate(num_cols + ['likes']):
  _ = train[[c]].boxplot(ax=axes[i], vert=False)

**Bivariate Analysis**

**Correlation heatmaps**

In [None]:
plt.figure(figsize=(14, 8))
_ = sns.heatmap(train[num_cols + ['likes']].corr(), annot=True)

**Pair Plots**

In [None]:
_ = sns.pairplot(train[num_cols + ['likes']], height=5, aspect=24/16)

**Categorical Columns**

In [None]:
train.columns

In [None]:
train['channel_title'].nunique()

In [None]:
cat_cols = ['category_id', 'country_code', 'channel_title']
fig, axes = plt.subplots(1, 2, figsize=(24, 10))

for i, c in enumerate(['category_id', 'country_code']):
    _ = train[c].value_counts()[::-1].plot(kind = 'pie', ax=axes[i], title=c, autopct='%.0f', fontsize=18)
    _ = axes[i].set_ylabel('')
    
_ = plt.tight_layout()

In [None]:
sns.set(rc={'figure.figsize':(12.7, 8.27)})

top_20_channels = train['channel_title'].value_counts()[:20].reset_index()
top_20_channels.columns = ['channel_title', 'num_videos']

_ = sns.barplot(data = top_20_channels, y = 'channel_title', x = 'num_videos')
_ = plt.title("Top 20 Channels with maximum number of videos")

**Bivariate Analysis**

**Country wise number of videos for channels**

In [None]:
country_wise_channels = train.groupby(['country_code', 'channel_title']).size().reset_index()
country_wise_channels.columns = ['country_code', 'channel_title', 'num_videos']
country_wise_channels = country_wise_channels.sort_values(by = 'num_videos', ascending=False)
fig, axes = plt.subplots(4, 1, figsize=(10, 20))

for i, c in enumerate(train['country_code'].unique()):
  country = country_wise_channels[country_wise_channels['country_code'] == c][:10]
  _ = sns.barplot(x = 'num_videos', y = 'channel_title', data = country, ax = axes[i])
  _ = axes[i].set_title(f'Country Code {c}')

plt.tight_layout()

**CatPlots**

**Likes distribution per Category**

In [None]:
_ = sns.catplot(x="category_id", y="likes", data=train, height=5, aspect=24/8)

**Likes Distribution per country**

In [None]:
_ = sns.catplot(x="country_code", y="likes", data=train, height=5, aspect=24/8)

**Average likes per Country**

In [None]:
_ = train.groupby('country_code')['likes'].mean().sort_values().plot(kind = 'barh')

Looks like videos posted in England have an higher average number of likes compared to videos posted in India.

**DateTime Variables**

In [None]:
train['publish_date'] = pd.to_datetime(train['publish_date'], format='%Y-%m-%d')
test['publish_date'] = pd.to_datetime(test['publish_date'], format='%Y-%m-%d')
train['publish_date']

In [None]:
train['publish_date'].min(), train['publish_date'].max()

In [None]:
train['publish_date'].dt.year.value_counts()

**Number of Videos in data datewise**

In [None]:
latest_data_train = train[train['publish_date'] > '2017-11']
latest_data_test = test[test['publish_date'] > '2017-11']
_ = latest_data_train.sort_values(by = 'publish_date').groupby('publish_date').size().rename('train').plot(figsize=(18, 6), title = 'Number of Videos')
_ = latest_data_test.sort_values(by = 'publish_date').groupby('publish_date').size().rename('test').plot(figsize=(18, 6), title = 'Number of Videos')
_ = plt.legend()

**Average likes in data sorted by date**

In [None]:
latest_data = train[train['publish_date'] > '2017-11']
_ = latest_data.sort_values(by = 'publish_date').groupby('publish_date')['likes'].mean().plot(figsize=(18, 6), title="Mean Likes")

**Number of videos by country**

In [None]:
tmp = latest_data.groupby(['publish_date', 'country_code']).size().reset_index()
_ = tmp.pivot_table(index = 'publish_date', columns = 'country_code', values=0).plot(subplots=True, figsize=(20, 20),
                                                                                           title='Number of Videos by country',
                                                                                           sharex=False,
                                                                                           fontsize=20)
plt.tight_layout()

**Average number of likes by country order by date**

In [None]:
tmp = latest_data.groupby(['publish_date', 'country_code'])['likes'].mean().reset_index()
_ = tmp.pivot_table(index = 'publish_date', columns = 'country_code', values='likes').plot(subplots=True, figsize=(20,20),
                                                                                           title='Average Number of Likes by country',
                                                                                           sharex=False,
                                                                                           fontsize=20)
plt.tight_layout()

**Analyze Textual Data**

In [None]:
text_cols = ['title', 'tags', 'description']

from wordcloud import WordCloud, STOPWORDS

wc = WordCloud(stopwords = set(list(STOPWORDS) + ['|']), random_state = 42)
fig, axes = plt.subplots(2, 2, figsize=(20, 12))
axes = [ax for axes_row in axes for ax in axes_row]

for i, c in enumerate(text_cols):
  op = wc.generate(str(train[c]))
  _ = axes[i].imshow(op)
  _ = axes[i].set_title(c.upper(), fontsize=24)
  _ = axes[i].axis('off')

_ = fig.delaxes(axes[3])

**Country wise highly liked Youtube videos top words**

In [None]:
def plot_countrywise(country_code = 'IN'):
  country = train[train['country_code'] == country_code]
  country = country[country['likes'] > 10]
  fig, axes = plt.subplots(2, 2, figsize=(20, 12))
  axes = [ax for axes_row in axes for ax in axes_row]

  for i, c in enumerate(text_cols):
    op = wc.generate(str(country[c]))
    _ = axes[i].imshow(op)
    _ = axes[i].set_title(c.upper(), fontsize=24)
    _ = axes[i].axis('off')

  fig.delaxes(axes[3])
  _ = plt.suptitle(f"Country Code: '{country_code}'", fontsize=30)

In [None]:
plot_countrywise("US")

In [None]:
plot_countrywise("GB")

In [None]:
plot_countrywise("IN")

In [None]:
plot_countrywise("CA")

In [None]:
train.head(2)

**Helper Function to Download Test Predictions as CSV**

In [None]:
def download_preds(preds_test, file_name = 'hacklive_sub.csv'):

  ## 1. Setting the target column with our obtained predictions
  ss[TARGET_COL] = preds_test

  ## 2. Saving our predictions to a csv file

  ss.to_csv(file_name, index = False)

In [None]:
ss = pd.read_csv('../input/av-guided-hackathon/sample_submission_cxCGjdN.csv')
train = pd.read_csv('../input/av-guided-hackathon/train.csv')
test = pd.read_csv('../input/av-guided-hackathon/test.csv')

**Segregate different types of columns**

In [None]:
num_cols = ['views', 'dislikes', 'comment_count']
cat_cols = ['category_id', 'country_code']
text_cols = ['title', 'channel_title', 'tags', 'description']
date_cols = ['publish_date']

**Concatenate train and test data**

In [None]:
train.shape, test.shape

In [None]:
df = pd.concat([train, test], axis=0).reset_index(drop = True)
df.shape

In [None]:
df.head(2)

**One hot Encoding on Categorical columns**

In [None]:
df = pd.get_dummies(df, columns = cat_cols)

In [None]:
df = df.fillna(-999)
df.isnull().sum().sum()

In [None]:
df[num_cols + ['likes']] = df[num_cols + ['likes']].apply(lambda x: np.log1p(x))

In [None]:
df['likes']

In [None]:
df.head(2)

**Split data into train and test**

In [None]:
train_proc, test_proc = df[:train.shape[0]], df[train.shape[0]:].reset_index(drop = True)
features = [c for c in train_proc.columns if c not in [ID_COL, TARGET_COL]]

**Split train data into train and validation sets**

In [None]:
trn, val = train_test_split(train_proc, test_size=0.2, random_state = 420)

###### Input to our model will be the features
X_trn, X_val = trn[features], val[features]

###### Output of our model will be the TARGET_COL
y_trn, y_val = trn[TARGET_COL], val[TARGET_COL]

##### Features for the test data that we will be predicting
X_test = test_proc[features]

**To check results on validation dataset after train the model**

In [None]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

def rmsle(y_true, y_pred):
  return np.sqrt(mean_squared_log_error(y_true, y_pred))

def av_metric(y_true, y_pred):
  return 1000 * np.sqrt(mean_squared_error(y_true, y_pred))

**Model on Numerical and Categorical columns**

In [None]:
features = [c for c in X_trn.columns if c not in [ID_COL, TARGET_COL]]
cat_num_cols = [c for c in features if c not in text_cols + date_cols]

In [None]:
clf = LinearRegression()

_ = clf.fit(X_trn[cat_num_cols], y_trn)

preds_val = clf.predict(X_val[cat_num_cols])

av_metric_score = av_metric(y_val, preds_val)

print(f'AV metric score is: {av_metric_score}')

In [None]:
clf = DecisionTreeRegressor(random_state=420)

_ = clf.fit(X_trn[cat_num_cols], y_trn)

preds_val = clf.predict(X_val[cat_num_cols])

av_metric_score = av_metric(y_val, preds_val)

print(f'AV metric score is: {av_metric_score}')

In [None]:
regr = RandomForestRegressor(max_depth=6, random_state=42)
_ = regr.fit(X_trn[cat_num_cols], y_trn)

preds_val = regr.predict(X_val[cat_num_cols])

av_metric_score = av_metric(y_val, preds_val)

print(f'AV metric score is: {av_metric_score}')

**HyperParameter Tuning with Random Search CV**

In [None]:

from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 20, stop = 200, num = 5)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(1, 45, num = 3)]
# Minimum number of samples required to split a node
min_samples_split = [5, 10]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split}

print(random_grid)



In [None]:
forest = RandomForestRegressor(n_jobs=-1)
rf_random = RandomizedSearchCV(estimator = forest, param_distributions = random_grid, n_iter = 10, cv = 10, verbose=2, random_state=42, n_jobs = -1, scoring='neg_mean_squared_error')
# Fit the random search model
search = rf_random.fit(train_proc[cat_num_cols], train_proc[TARGET_COL])
search.best_params_

In [None]:
best_params = {'n_estimators': 155,
 'min_samples_split': 5,
 'max_features': 'sqrt',
 'max_depth': 23}

In [None]:
regr = RandomForestRegressor(**best_params)
_ = regr.fit(X_trn[cat_num_cols], y_trn)

preds_val = regr.predict(X_val[cat_num_cols])

av_metric_score = av_metric(y_val, preds_val)

print(f'AV metric score is: {av_metric_score}')

In [None]:
preds_test = clf.predict(X_test[cat_num_cols])

preds_test = np.expm1(preds_test)

download_preds(preds_test, 'regr_num_cat.csv')

**Using Stratified - K_Fold Validation**

**Helper Function to run Stratified K-Fold**

In [None]:
pd.qcut(np.arange(10), 5, labels = False, duplicates='drop')

In [None]:
from sklearn.model_selection import StratifiedKFold
def run_clf_kfold(clf, train, test, features):

  N_SPLITS = 5

  oofs = np.zeros(len(train))
  preds = np.zeros((len(test)))

  target = train[TARGET_COL]

  folds = StratifiedKFold(n_splits = N_SPLITS)
  stratified_target = pd.qcut(train[TARGET_COL], 10, labels = False, duplicates='drop')

  for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, stratified_target)):
    print(f'\n------------- Fold {fold_ + 1} -------------')

    ############# Get train, validation and test sets along with targets ################
  
    ### Training Set
    X_trn, y_trn = train[features].iloc[trn_idx], target.iloc[trn_idx]

    ### Validation Set
    X_val, y_val = train[features].iloc[val_idx], target.iloc[val_idx]

    ### Test Set
    X_test = test[features]

    ############# Scaling Data ################
    scaler = StandardScaler()
    _ = scaler.fit(X_trn)

    X_trn = scaler.transform(X_trn)
    X_val = scaler.transform(X_val)
    X_test = scaler.transform(X_test)


    ############# Fitting and Predicting ################

    _ = clf.fit(X_trn, y_trn)

    ### Instead of directly predicting the classes we will obtain the probability of positive class.
    preds_val = clf.predict(X_val)
    preds_test = clf.predict(X_test)

    fold_score = av_metric(y_val, preds_val)
    print(f'\nAV metric score for validation set is {fold_score}')

    oofs[val_idx] = preds_val
    preds += preds_test / N_SPLITS


  oofs_score = av_metric(target, oofs)
  print(f'\n\nAV metric for oofs is {oofs_score}')

  return oofs, preds

**K-Fold on Random Forest**

In [None]:
rf_params = best_params = {'n_estimators': 155,
 'min_samples_split': 5,
 'max_features': 'sqrt',
 'max_depth': 23}

clf = RandomForestRegressor(**rf_params)
        

dt_oofs, dt_preds = run_clf_kfold(clf, train_proc, test_proc, cat_num_cols)

**Boosting Algorithm**

In [None]:
def run_gradient_boosting(clf, fit_params, train, test, features):
  N_SPLITS = 5
  oofs = np.zeros(len(train_proc))
  preds = np.zeros((len(test_proc)))

  target = train[TARGET_COL]

  folds = StratifiedKFold(n_splits = N_SPLITS)
  stratified_target = pd.qcut(train[TARGET_COL], 10, labels = False, duplicates='drop')

  feature_importances = pd.DataFrame()

  for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, stratified_target)):
    print(f'\n------------- Fold {fold_ + 1} -------------')

    ### Training Set
    X_trn, y_trn = train[features].iloc[trn_idx], target.iloc[trn_idx]

    ### Validation Set
    X_val, y_val = train[features].iloc[val_idx], target.iloc[val_idx]

    ### Test Set
    X_test = test[features]

    scaler = StandardScaler()
    _ = scaler.fit(X_trn)

    X_trn = scaler.transform(X_trn)
    X_val = scaler.transform(X_val)
    X_test = scaler.transform(X_test)
    
    _ = clf.fit(X_trn, y_trn, eval_set = [(X_val, y_val)], **fit_params)

    fold_importance = pd.DataFrame({'fold': fold_ + 1, 'feature': features, 'importance': clf.feature_importances_})
    feature_importances = pd.concat([feature_importances, fold_importance], axis=0)

    ### Instead of directly predicting the classes we will obtain the probability of positive class.
    preds_val = clf.predict(X_val)
    preds_test = clf.predict(X_test)

    fold_score = av_metric(y_val, preds_val)
    print(f'\nAV metric score for validation set is {fold_score}')

    oofs[val_idx] = preds_val
    preds += preds_test / N_SPLITS


  oofs_score = av_metric(target, oofs)
  print(f'\n\nAV metric for oofs is {oofs_score}')

  feature_importances = feature_importances.reset_index(drop = True)
  fi = feature_importances.groupby('feature')['importance'].mean().sort_values(ascending = False)[:20][::-1]
  fi.plot(kind = 'barh', figsize=(12, 6))

  return oofs, preds, fi

In [None]:
clf = CatBoostRegressor(n_estimators = 3000,
                       learning_rate = 0.05,
                       rsm = 0.4, ## Analogous to colsample_bytree
                       random_state=420,
                       )

fit_params = {'verbose': 200, 'early_stopping_rounds': 200}

cb_oofs, cb_preds, fi = run_gradient_boosting(clf, fit_params, train_proc, test_proc, cat_num_cols)

**Feature Engineering**

**Helper Function**

In [None]:
def join_df(train, test):

  df = pd.concat([train, test], axis=0).reset_index(drop = True)
  features = [c for c in df.columns if c not in [ID_COL, TARGET_COL]]
  df[num_cols + ['likes']] = df[num_cols + ['likes']].apply(lambda x: np.log1p(x))

  return df, features

def split_df_and_get_features(df, train_nrows):

  train, test = df[:train_nrows].reset_index(drop = True), df[train_nrows:].reset_index(drop = True)
  features = [c for c in train.columns if c not in [ID_COL, TARGET_COL]]
  
  return train, test, features

In [None]:
df, features = join_df(train, test)

In [None]:
cat_cols = ['category_id', 'country_code', 'channel_title']

In [None]:
### Label Encoding

df[cat_cols] = df[cat_cols].apply(lambda x: pd.factorize(x)[0])

In [None]:
df['publish_date'] = pd.to_datetime(df['publish_date'], format='%Y-%m-%d')
df['publish_date_days_since_start'] = (df['publish_date'] - df['publish_date'].min()).dt.days

df['publish_date_day_of_week'] = df['publish_date'].dt.dayofweek
df['publish_date_year'] = df['publish_date'].dt.year
df['publish_date_month'] = df['publish_date'].dt.month

In [None]:
features = [c for c in df.columns if c not in [ID_COL, TARGET_COL]]
cat_num_cols = [c for c in features if c not in ['title', 'tags', 'description', 'publish_date']]

In [None]:
cat_num_cols

In [None]:
train_proc, test_proc, features = split_df_and_get_features(df, train.shape[0])

In [None]:
clf = CatBoostRegressor(n_estimators = 3000,
                       learning_rate = 0.05,
                       rsm = 0.4, ## Analogous to colsample_bytree
                       random_state=420,
                       )

fit_params = {'verbose': 200, 'early_stopping_rounds': 200}

cb_oofs, cb_preds, fi = run_gradient_boosting(clf, fit_params, train_proc, test_proc, cat_num_cols)

In [None]:
df['channel_title_num_videos'] = df['channel_title'].map(df['channel_title'].value_counts())
df['publish_date_num_videos'] = df['publish_date'].map(df['publish_date'].value_counts())

In [None]:
train_proc, test_proc, features = split_df_and_get_features(df, train.shape[0])
features = [c for c in df.columns if c not in [ID_COL, TARGET_COL]]
cat_num_cols = [c for c in features if c not in ['title', 'tags', 'description', 'publish_date']]

In [None]:
cat_num_cols

In [None]:
clf = CatBoostRegressor(n_estimators = 3000,
                       learning_rate = 0.05,
                       rsm = 0.4, ## Analogous to colsample_bytree
                       random_state=420,
                       )

fit_params = {'verbose': 200, 'early_stopping_rounds': 200}

cb_oofs, cb_preds, fi = run_gradient_boosting(clf, fit_params, train_proc, test_proc, cat_num_cols)

In [None]:
df['channel_in_n_countries'] = df.groupby('channel_title')['country_code'].transform('nunique')
df['channel_in_n_countries'].unique()

In [None]:
train_proc, test_proc, features = split_df_and_get_features(df, train.shape[0])
features = [c for c in df.columns if c not in [ID_COL, TARGET_COL]]
cat_num_cols = [c for c in features if c not in ['title', 'tags', 'description', 'publish_date']]

In [None]:
clf = CatBoostRegressor(n_estimators = 3000,
                       learning_rate = 0.05,
                       rsm = 0.4, ## Analogous to colsample_bytree
                       random_state=420,
                       )

fit_params = {'verbose': 200, 'early_stopping_rounds': 200}

cb_oofs, cb_preds, fi = run_gradient_boosting(clf, fit_params, train_proc, test_proc, cat_num_cols)

**Grouping Features**

In [None]:
df['channel_title_mean_views'] = df.groupby('channel_title')['views'].transform('mean')
df['channel_title_max_views'] = df.groupby('channel_title')['views'].transform('max')
df['channel_title_min_views'] = df.groupby('channel_title')['views'].transform('min')

df['channel_title_mean_comments'] = df.groupby('channel_title')['comment_count'].transform('mean')
df['channel_title_max_comments'] = df.groupby('channel_title')['comment_count'].transform('max')
df['channel_title_min_comments'] = df.groupby('channel_title')['comment_count'].transform('min')

In [None]:
train_proc, test_proc, features = split_df_and_get_features(df, train.shape[0])
features = [c for c in df.columns if c not in [ID_COL, TARGET_COL]]
cat_num_cols = [c for c in features if c not in ['title', 'tags', 'description', 'publish_date']]

In [None]:
clf = CatBoostRegressor(n_estimators = 3000,
                       learning_rate = 0.05,
                       rsm = 0.4, ## Analogous to colsample_bytree
                       random_state=420,
                       )

fit_params = {'verbose': 200, 'early_stopping_rounds': 200}

cb_oofs, cb_preds, fi = run_gradient_boosting(clf, fit_params, train_proc, test_proc, cat_num_cols)

In [None]:
cb_preds_t = np.expm1(cb_preds)
download_preds(cb_preds_t, file_name = 'catboost_5_folds.csv')

**Feature Engineering for text data**

In [None]:
df['title_len'] = df['title'].apply(lambda x: len(x))
df['description_len'] = df['description'].apply(lambda x: len(x))
df['tags_len'] = df['tags'].apply(lambda x: len(x))

In [None]:
train_proc, test_proc, features = split_df_and_get_features(df, train.shape[0])
features = [c for c in df.columns if c not in [ID_COL, TARGET_COL]]
cat_num_cols = [c for c in features if c not in ['title', 'tags', 'description', 'publish_date']]

In [None]:
clf = CatBoostRegressor(n_estimators = 3000,
                       learning_rate = 0.05,
                       rsm = 0.4, ## Analogous to colsample_bytree
                       random_state=420,
                       )

fit_params = {'verbose': 200, 'early_stopping_rounds': 200}

cb_oofs, cb_preds, fi = run_gradient_boosting(clf, fit_params, train_proc, test_proc, cat_num_cols)

**Bag of Words Approach for Text Based Features**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
?CountVectorizer

In [None]:
TOP_N_WORDS = 50

vec = CountVectorizer(max_features = TOP_N_WORDS)
txt_to_fts = vec.fit_transform(df['description']).toarray()
txt_to_fts.shape

In [None]:
c = 'description'
txt_fts_names = [c + f'_word_{i}_count' for i in range(TOP_N_WORDS)]
df[txt_fts_names] = txt_to_fts

train_proc, test_proc, features = split_df_and_get_features(df, train.shape[0])
features = [c for c in df.columns if c not in [ID_COL, TARGET_COL]]
cat_num_cols = [c for c in features if c not in ['title', 'tags', 'description', 'publish_date']]

In [None]:
cat_num_cols

In [None]:
clf = CatBoostRegressor(n_estimators = 4000,
                       learning_rate = 0.06,
                       rsm = 0.4, ## Analogous to colsample_bytree
                       random_state=4200,
                       )

fit_params = {'verbose': 300, 'early_stopping_rounds': 200}

cb_oofs, cb_preds, fi = run_gradient_boosting(clf, fit_params, train_proc, test_proc, cat_num_cols) 

In [None]:
cb_preds_t = np.expm1(cb_preds)
download_preds(cb_preds_t, file_name = 'catboost_text_cols_bow.csv')

In [None]:
clf = LGBMRegressor(n_estimators = 4000,
                        learning_rate = 0.04,
                        colsample_bytree = 0.65,
                        metric = 'None',
                        )
fit_params = {'verbose': 200, 'early_stopping_rounds': 200, 'eval_metric': 'rmse'}

lgb_oofs, lgb_preds, fi = run_gradient_boosting(clf, fit_params, train_proc, test_proc, cat_num_cols)

In [None]:
lgb_preds_t = np.expm1(lgb_preds)
download_preds(lgb_preds_t, file_name = 'lgb_text_cols_bow.csv')

In [None]:
clf = XGBRegressor(n_estimators = 3000,
                    max_depth = 7,
                    learning_rate = 0.05,
                    colsample_bytree = 0.5,
                    random_state=4200,
                    )

fit_params = {'verbose': 200, 'early_stopping_rounds': 200}

xgb_oofs, xgb_preds, fi = run_gradient_boosting(clf, fit_params, train_proc, test_proc, cat_num_cols)

In [None]:
xgb_preds_t = np.expm1(xgb_preds)

download_preds(xgb_preds_t, file_name = 'xgb_text_cols_bow.csv')

**Ensembling**

In [None]:
 av_metric(np.log1p(train[TARGET_COL]), lgb_oofs * 0.7 + cb_oofs * 0.3)

In [None]:
train_new = train[[ID_COL, TARGET_COL]]
train_new[TARGET_COL] = np.log1p(train_new[TARGET_COL])

test_new = test[[ID_COL]]

train_new['lgb'] = lgb_oofs
test_new['lgb'] = lgb_preds

train_new['cb'] = cb_oofs
test_new['cb'] = cb_preds

train_new['xgb'] = xgb_oofs
test_new['xgb'] = xgb_preds

features = [c for c in train_new.columns if c not in [ID_COL, TARGET_COL]]

In [None]:
clf = LinearRegression()

ens_oofs, ens_preds = run_clf_kfold(clf, train_new, test_new, features)

In [None]:
ens_preds_t = np.expm1(ens_preds)
download_preds(ens_preds_t, file_name = 'hacklive_ensemble_final.csv')

In [None]:
pd.read_csv('hacklive_ensemble_final.csv')